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Preface to Second Edition 


In the ten years that have elapsed since this book was first 
published, certain additional statistical methods of particular 
value toresearch workers have been developed. These methods, 
presented as simply as possible, are included in this new 
edition. The whole of the original text has been revised and 
some small arithmetical errors corrected. I am grateful to 
those who have pointed out these mistakes. 

The original numbering of the chapters and sections has 
been preserved as far as possible. The chief additional material 
is as follows. An account of the nature and use of the binomial 
distribution is given in Chapter ту. The calculation of ¢ from 
arbitrary origins, of the variance ratio and of the standard 
error of the difference between proportions is included in 
Chapter v. In Chapter vit the Kendall coefficient of rank 
correlation is introduced and in Chapter xa method of fitting 
linear and logarithmic curves to observational data is ex- 
plained, Chapters x and X1 are entirely new, the latter giving 
a short introduction to the method of analysis of variance. 
There are also some’ hitherto unpublished tables in the 


Appendices F-K. Exercises on the new material are provided, 


with answers. I am especially grateful to Mr J. W. Whitfield 
ts and criticisms. 


and Miss V. R. Cane for helpful commen 
E.G.C. 


Cambridge 
June 1950 


Preface to First Edition 


The purpose of this book is to explain as simply as possible 
how to perform the calculations involved in the commoner 
statistical methods. It is not in any sense a treatise on the 
theory of statistics, only sufficient theory being given to 
enable the student to understand the use and application of 
the methods described. 

No assumption of mathematical ability on the part of the 
reader is made; the’ calculations described involve the use of 
arithmetic only. A worked example of each method given is 
provided and abundant exercises with answers are supplied. 

Whilst the book is chiefly addressed to students of the bio- 
logical sciences, especially Psychology, the methods described 
are fundamental to statistical work and should, it is hoped, 
prove useful to anyone who has to make use of elementary 
Statistical methods. 

I should like to express my sincere gratitude to Dr J. 0. 
Trwin, who very kindly read the whole of the manuscript and 
made many extremely helpful criticisms, and to Dr J. Wishart, 
who made some very valuable suggestions when the manu- 
script was approaching its final form. I wish to acknowledge 
also the kindness of Professor R. A. Fisher and his publishers, 
Messrs Oliver and Boyd, in allowing me to print extracts from 
various statistical tables given in his book Statistical Methods 
for Research Workers, and in the statistical tables by Fisher 
and Yates, also published by Oliver and Boyd. Full references 
to these two works are made in the text. 


E. G. CHAMBERS 


Cambridge 
August 1940 
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Chapter I 
INTRODUCTION 


1.i. Description of certain statistical terms. The 
statistical methods described in this book are all concerned 
with the treatment of variables. By a variable is meant a 
quantity which assumes different values that may be measured 
in some appropriate unit. Height, weight, test scores, readings 
ona thermometer, etc., are examples of variables, or variates, 
as they are sometimes called. Variables are usually denoted by 
X or Y in this book. The number of times a particular value of 
a variable occurs in a set of observations is called the frequency 
of occurrence of that value, and a table showing the frequency 
of occurrence of all the values of a variable in a set of observa- 
tions is named a frequency distribution table. 

A series of observations may be represented by one value 
which is called an average, and the way in which the different 
values of the variable lie about this average is described as the 
scatter or dispersion of the observations. Measures of averages 
and scatter are descriptive statistics, since they yield in a con- 
densed form a description of a whole series of observations. 
This is the first function of statistical method: the other chief 
function is the examination of various hypotheses which are 
made about observational data. 

It is usually impossible to measure all the values of any 
variable, so that the data from a single experiment are only a 
sample drawn from the total population of possible observa- 
tions, For example, if the variable is human height, ‘then the 
total population of that variable would be the height of every 
man, woman and child ever on earth; it is manifestly impos- 
sible to measure all these values and in practice we have to 
content ourselves with measuring the heights of a sample of 
some convenient size. The distribution of the total population 
can usually be expressed in a mathematical form by using a 


small number of constants or parameters. Obviously we can 


csc I 
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never know the exact values of these parameters, since we 
cannot measure the whole population, but we can make 
estimates of them by measuring random. samples, that is, 
samples drawn purely at random from the population. These 
estimates are known as statistics, and their accuracy as esti- 
mates depends on the size of the sample and the type of 
distribution of the variable. 

In calculating some statistics it is essential to know the 
number of degrees of freedom available for the calculation. The 
conception of ‘degrees of freedom’ is not an easy one for the 
beginner, so that whenever the term is used in this book cate- 
gorical rules for determining the number of degrees of freedom 
are supplied. Consideration of the following example may give 
some idea of the meaning of the term. Suppose 100 shillings 
are to be shared amongst 10 boys. We may give as many as we 
like (up to a total of 99) to each of 9 of the boys, but we are 
bound to give the tenth boy what is left over, i.e. we have only 
9 degrees of freedom in sharing out the shillings. If we were 
told further that 60 shillings were to be shared amongst the 
oldest 5 boys and the remainder amongst the youngest 5, we 
should only have 8 degrees of freedom for doing this, since the 
fifth boy in each group would have to have what was left over 
—we should not be free to vary his share. ( 

Two variables which are related together, so that a know- 
ledge of the values of one variable indicates likely values of. 
the other, are said to be associated or correlated. If the variables 
are unrelated, they are said to be independent. 

Other terms, applicable to particular methods, will be 
described in their appropriate places, 


1.ii. Notation. A certain amount of symbolism is essen- 
‚ tial in the description of statistical methods. Unfortunately 
there is a lack of agreement amongst different authors, which 
is apt to be confusing to the beginner. For this reason an 
attempt at a consistent method of notation is made in this 
book. Being based on first principles it is hoped that it will 
be readily understood by the learner and will enable him to . 
follow the notation used in the standard books on statistical | 


1.ii] INTRODUCTION 3 


methods. Symbols in general established use are taken over 
unchanged. Here again there is confusion, since owing to the 
multiplicity of statistics the same symbol may have to stand 
for quite different quantities. Thus the symbol z is used 
for three different quantities and particular care is needed 
on the part of the student to avoid misconception in such 
cases. 

Certain symbols have the same significance throughout the 
whole of this book. For example, № always stands for the total 
number of observations in a sample and S always signifies 
‘the sum of’. The other notation is made as unambiguous as 
possible. Certain Greek letters are used as symbols: a list of 
these with their pronunciation is given below: 

a 


B (beta) 7 (pi) 

y (gamma) p (rho) 

ô (delta) c (sigma) 
© (zeta) т (tau) 
7? (eta) ф (phi) 
ш (mu) X (ki) 

v (nu) 


X (capital sigma) is used in some places to indicate ‘the sum 
of’. As a rule S is used for summation of sample values and У 
for derived quantities. 
Care must be taken by the student to avoid confusing 
suffixes and indices. Suffixes are small numbers or letters 
written after a symbol at the foot, e.g. 21, Oz, etc.; these are 
merely descriptive and confine the use of the symbol to a par- 
ticular purpose. Indices are small numbers written after and 
above symbols and have their usual algebraical significance; 
for example, x? (x squared) means v multiplied by x, y? (y 
cubed) means y multiplied by y multiplied by y, and so on. 
The usual arithmetical symbols, +, —, x and +, have their 
accustomed significance. There are three other symbols with 
which the non-mathematical student may not be familiar. 
Vertical lines drawn on each side of a quantity mean ‘the 
positive numerical value of ’, e.g. |a—b | means ‘the = 
numerical value of the difference between a and b’. Usin = 


1-2 
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notation, therefore, it does not matter whether we write | a — b | 
or | b—a |. Secondly, there is the factorial sign, ‘!’. This latter 
is best explained by examples, e.g. 4! stands for 4x 3x 2x 1, 
6! for 6x5x4x3x2x1, and so on. Thirdly, () means the 
number of combinations of n things taken p at а time. Ex- 
| 
anded algebraically, (") и. 

y i У ip] 7 рар) 

Since a good deal of arithmetical work is involved in certain 
of the statistical methods described in the following chapters, 


itisan advantage for the student to be familiar with the use of 
logarithms (unless he has a calculating machine available). 


1.iii. References. Reference is made in the following 
pages to two invaluable books on statistics and to certain 
books of statistical tables. The references aremadenumerically 
to the following works: 


(1) An Introduction to the Theory of Statistics. G. Udny Yule 
and M. G. Kendall. 11th edition, Charles Griffin and Co. 
Ltd. 1937. 

(2) Statistical Methods for Research Workers. R. A. Fisher. 
9th edition. Oliver and Boyd. 1944. з 

(3) Tables for Statisticians and Biometricians, Part I. Edited 
by Karl Pearson. Biometrika Office, University College, 
London. 

(4) Barlow's Tables of Squares, Cubes, Square-roots, Cube-roots 
and Reciprocals of all Integral Numbers up to 12,500. 
Е. and Е. М. Spon. 4th edition. 1941. i 

(5) Statistical Tables for Biological Agricultural and Medica 
Research. Fisher and Yates. Oliver and Boyd. 2nd edition, 
1942. 

(6) The Advanced Theory of Statistics. М. G. Kendall. Charles 
Griffin and Co. Ltd. Vol. т, 1945; vol. п, 1946. 


Since no attempt is made in this book to prove or Justify the 
various methods and formulae used, the student wishing to go 
into such matters is referred to the first two and the last of 
the foregoing works. 
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liv. Use and abuse of statistical methods. The 
student who works conscientiously through the following 
chapters should learn how to make use of the commoner 
methods of statistics. He should never forget, however, that 
statistical methods are merely tools for a research worker. 
They enable him to describe, relate and assess the value of his 
observations. They cannot make amends for incorrect obser- 


vation nor can they of themselves provide a single fact of 


psychology, biology or any other subject of research. Statis- 


tical methods are to the research worker what tools are to a 
carpenter. The latter has first to learn how to use his tools and 
he may then by employing them reveal the useful and beauti- 
ful purposes to which his material may be put. But the tools 
themselves must be used for their correct functions. The 
craftsman will not, for instance, use а mallet and chisel or a 
fretsaw to plane a plank of wood, nor will he use a hammer to 
drive in a screw. In the same way statistical methods must 
only be used by the research worker for the purposes for which 
they have been devised. 

ter’s tools cannot tell him directly any- 


Further, a carpen 
thing about the materials he is using. They cannot by them- 


selves distinguish between mahogany and deal nor prove that 
oak is more durable than white wood. No carpenter’s tools 
have ever yet made 2 piece of wood; similarly no statistical 
method has ever yet produced a biological fact. 

The student is advised, therefore, to try to acquire an under- 
standing of the specific purpose of each statistical method he 
learns to use, to appreciate the scope of and the assumptions 
underlying the use of each formula, and to realise that the out- 
come of each calculation is а statistical statement which has 
to be interpreted in terms of the particular branch of science 
from which the data for examination are drawn. 


Chapter II 
AVERAGES 


2.i. The arithmetic mean. The best known and most 
useful form of average is the arithmetic mean, usually referred 
to as the ‘mean’ or the ‘average’. It is easily calculated by 
adding together all the observations to be averaged and 
dividing the sum or total by the number of observations, 


Example 1. Find the mean of the following observations: 
22, 24, 20, 23, 21, 19, 23, 22, 20, 22, 20, 22, 23, 25, 21, 21, 22, 
24, 23, 22, 23, 21, 22, 21, 23. 

Add together all the observations. 

The sum = 549, j 
The number of observations = 25; 


The arithmetic mean LARO Observations 
no. of observations 
_ 549 
Bes 
= 21:96. 

This procedure may be generalised to cover all cases. If X 
is a variable which has different values X, X5, X4, ete., then 
the arithmetic mean of a number N of such values is the sum 
of the various values of X ; Which we denote by S(X), divided 
by М, the number of them. In general, therefore, 

M; = Х = 500 Я 


(1) 


Here m, and X (called X-bar) are different ways of denoting 
‘the mean of X’. 


2.11. If N is large and no adding machine is availah 
process of addition may be very laborious. It may, 
be made easier by the construction of a frequenc 


le, the 
however, 
y distribu. 
В valu 
mple under conside: Soi 


Tati 
Example 1, the values taken by X all lie between, Du E 
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inclusive. If we count how many times each different value of 
X occurs and write the totals in tabular form, we obtain the 
frequency distribution table given below. 


TABLE I 
zx 
19 
20 
21 
22 
23 
24 
25 


— t9 0-9 09 н S 


25 
In this table the first column, headed X, shows the different 
values assumed by the variableX in the sample, and the second. 
column, headed f, gives the number of times, or frequency, of 
occurrence of each. The total of the f column is, of course, the 
total number of observations we are averaging, i.e. S(f) — N. 
The next step is to write down a third column, headed fX, 
which is produced by multiplying together the corresponding 
pairs of numbers in the X and f columns. We then sum the 
fX column, giving us 2( fX), and the arithmetic mean is then 
obtained by dividing this sum by N as before; i.e. 
Ses 
m,=X= zu " (2) 
с mean of the observations in 


Example 2. Calculate th 
frequency of distribution table. 


Example 1 by constructing а 


ЕХ Í JX 

19 1 19 

20 3 60 У(Х) = 549, 

21 5 105 N = 25, 

22 7 154 _ 549 

23 6 138 ха 

24 2 48 

25 1 25 = 21-96 
25 549 


It will be noted that this result is identical with that 
obtained in Example 1. : 
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2.iii. The method of Section 2.ii is useful when the range 
of the X values is small, but if there are many different values 
of X the method again becomes laborious. Suppose the obser- 
vations in Example 1 were the lengths of 25 sticks measured 
in centimetres, each one being measured to the nearest centi- 
metre. Now let us suppose we had a large number of such 
Sticks and measured them in millimetres. The range of the 
measurements might now be from 190 to 252 mm., so that if 
we constructed a frequency distribution table of these we 
should have 63 different values of X to tabulate. This would 


TABLE II 
x ай 

190-193 2 
194-197 4 
198-201 т ®Ф 
202-205 12 
206-209 19 
210-213 24 
214-217 2T 
218-221 369.5: 
222-225 26% 
226-229 21 
230-233 18 
234-237 132 42 
238—241 6 
242-245 5! 
246—249 2 | 
250-253 1 


be tedious, and the calculation may be shortened, with some 
small sacrifice of accuracy, by subdividing the range of the 
X’s into a convenient number of groups. In practice, a number 
of groups between 12 and 20 should be chosen, and the best 
unit for grouping may be found by dividing the range first by 
12 and then by 20, and taking a convenient unit in between 
these two quotients. For instance, if the range is 63, then the 
results of dividing 63 first by 12 and then by 20 are 5-25 ang 
3-15. Hence a convenient working unit for gro 


uping w j 
4 in this case. This means that we should group the Mud jo 
i о 
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X together in 4’s, so that the first group would comprise 190, 
191, 192 and 193, the second 194, 195, 196 and 197, and so on, 
the last group being 250, 251, 252 and 253. We could now con- 
struct a frequency distribution table of these groups. Such a 
table might be, for example, as Table II. This gives the fre- 
quency distribution of the lengths of 222 sticks measured in 
4mm. groups. : 

Now the method of calculating the mean length of the sticks 
from such a table depends on the assumption that the average 
length of the sticks in each group is equal to the mean value of 
X for that group. For instance, there are 12 sticks in the 202- 
205 group and we shall assume that the average length of those 
12 sticks is equal to the average of 202, 203, 204 and 205, i.e. 
203-5. This is, of course, an assumption, but the larger the 
number of readings in each group the nearer it becomes to 


being true. 
Care should be taken in arranging the grouping that this 


assumption should be as nearly as possible true for the end 
groups, leaving the middle ones, which usually have more 
readings in them, to look after themselves. For instance, if 
the two readings in the 190-193 group were both 190, then 
their average would be 190 instead of 191-5, which is assumed 
by the grouping. In such a.case it would have been better to 
have started the grouping from 189, which would assume an 
Average for the group of 190-5. 

In order to calculate the mean of the observations in a 
grouped frequency distribution table, such as Table II, we 
take an arbitrary origin, OF starting point, and then calculate 
the discrepancy between this point and the true mean. Let us 
take our arbitrary origin near the middle of the range, since 
this simplifies the arithmetic. It is convenient to have it at the 
centre of a group 50 we will choose the 218-221 group, so that 
our arbitrary origin will be 219-5, the centre of the group. This 
group we number 0. The next group in the table, 222-225, has 
an average of 223-5, which is one group unit, or working unit, 
itrary origin, and we therefore number this 


above the arbi г 
group 1. Ina similar manner the 226—229 group is numbered 2, 


and so on. 
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ЕМИ orking 
The 214-217 group averages 215.5, which 18 uec. 
unit less than the arbitrary origin, so this E w so on. 
il, Similarly the 210-213 group becomes — 2, an 


А П. 
Example 3. Calculate the mean of the data in Table 


2x f 2 Su 
190-193 2 UL 14 
194-197 4 =6 -4 
198—201 7 —5 —35 
20205 15 а —48 
206-209 19 cu. cy X(fz) = 3, 
iato. s —48 N = 222, 
РБА -27 os 
218-953 35 0 953» p222 = QU EE 
222-925 ` 96 1 26 Md 
226-229. $4 2 42 pde 
230-233 18 S 51 а 219:5, x4 
284-937 13 4 52 т. = 219:5 + 0:0135 
238-241 6 5 30 = 219-5 + 0:054 
249 945 5! 6 30 : 
246-249 2 7 14 ee ur: 
250-253 1 8 8 
225 256 
~253 
3 


* Since there will be no entry in the fx column corresponding bi 
2 — 0, this іза convenient Place to add the negative entries in the f 
column. 

СА 

We сап now Te 
we will head Nadas indicates th 
that each X -gro 
column is now 


column, headed Л а, is 
multiplying corresponding 
У adding this column we get 


/ 


T: 
m 


= =(fz) 
D- SEL 


This quantity D tells us how many Working Units the t; 


rue 


D 
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mean m, lies away from the arbitrary origin, which we will 
call m,. Hence the true mean is given by the formula 
M, = T = т, +. Рю, (4) 

where w is the size of the working unit. If D turns out to be 
negative, then Dw will have to be subtracted from m,. If it 
should happen that the arbitrary origin is the true mean, as 
might happen with a perfectly symmetrical distribution, then 
D will, of course, be zero. 

The whole process of calculating the mean by this method is 
shown in Example 3. 

We conclude therefore that the mean length of the sticks 
was 220mm., to the nearest millimetre. 


2.iv. The median. Another form of average which is 
sometimes used for convenience is the median. This, as its 
name implies, is the middle observation and it is easily found 
in ungrouped data by ranking the sample of observations in 
order of their size and finding the central observations, if N 
is odd, or the mean of the two observations in the middle, if 
N is even. If М is odd, the median will be the (NV +1)/2th 
Observation. If N is even, it will be the mean of the N/2th and 
the №/2 + 1th observations. For example, the median of 2, 4, 6, 
8, 10, 12 is the mean of the 3rd and 4th readings, i.e. 

(6+8)/2 = 7: 

The median is representative of a set of observations in the 
sense that there are exactly as many observations greater than 
it аз there are less. If the distribution is perfectly symmetrical, 
the median is equal to the mean. 

Example 4. Find the median of the observations in Ex- 
ample 1. 

Ranking the ONORS in order of size we get: 19, 20, 20, 
20, 21, 21, 21, 21, 21, 22, 22, 22, 22, 22, 22, 22, 23, 23, 23, 23, 
23, 23, 24, 24, 25. 

Since there are 25 observations, the median will be the 13th, 
ie. 22. 

2.v. Finding the median for grouped data involves a little 
approximation, а5 was the casein finding the mean of grouped 
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data. We assume that the values in each group are evenly 
distributed through the group and the approximate value of 
the median may be found by linear interpolation. Suppose, for 
example, we wished to find the median of the data in Table II. 
We have to find the value of the observation below which half 
the observations lie. № in this case is 222, во that 4N is 111. 
The median therefore will be the position of the mean of the 
111th and 112th observations. By adding together the first 
seven entries in the f column we find that 95 of the observations 
lie below 217:5,* and there are 35 Observations in the 218-221 
group. Obviously, therefore, the median lies somewhere in this 
group. The position may be found by simple proportion. 
111—95 = 16. In the 218 221 group 16 observations will 
an the median, assuming that 
venly distributed. The group 


f the previous group. 
Hence the median of the data in Table IT is 
16 
INS х4 = 217-54 1.83 = 219-33. 
This may be checked by Working from the other end of the 
table. We find that 92 observations Не above 221-5, so that the 
median is 19 : 


111—92 


Ы RT х4 = 291.5 


2:17 = 219.33. 


8 observation Occ; 
urs more 
often than any other. However, tho mode cannot usually be 
Samples, since errors of 


ent occurre. 
1 nce of some 
observation remote from the true mode. 


* Note that since the measurements in the X со] 
the nearest millimetre, any observation up to but not 
will count as 217 or under, so that the real upper limit 
217-5. Similarly the real lower limit of the 222995 р 


аге made to _ 
qual to 217-5 
f the group is 
group is 221.5, 
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If a graph of the frequency distribution of any variable for 
the total population were drawn, we should have a smooth 
curve (see Fig. 2, р. 25 for an example). In such a curve the 
position of the mode would be given by the highest point, since 
there would be the maximum frequency at that point. Such 
а curve might be symmetrical or it might be asymmetrical, or 
skew. It has been found that in curves of moderate skewness 
the position of the modeis given approximately by the relation 

Mode = mean — 3 (mean — median). 
It is better, therefore, to make use of this equation for finding 
the mode of a sample than to rely on picking out the most 
frequently occurring observation. (We are considering here 
only distributions which give a hump-backed curve and have 
therefore only one mode. Certain distributions may have two 
or more modes, but the elementary student need not concern 


himself with these.) 


he mode of the data in Table IT. 
eviously that the mean of the data is 
uting these values in 


Example 5. Find + 
We have found pr 
219-55 and the median is 219-33. Substit 
the equation ` 
Mode = mean — 3 (mean— median), 
Mode = 219:55— 3 (219-55 — 219.33) 
= 219-55 — 0:66 
= 218:89. 
en from the above equation that if the 
qual the mode is also equal to the mean. 
ble which is distributed in a perfectly 
the median and the mode are 


we have 


2.vii. It may be se 
mean and median are e 
Thus in the case ofa varia 
symmetrical manner, the mean, 


all equal. 


Note on the construction of a frequency distribution table 
If the observations to be tabulated are on cards, the fre- 
quency distribution table is easily formed by sorting the cards 
into their appropriate groups and counting the number of 
cards in each group. When, however, the data are not on cards, 
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à frequency table may be made by the use of a spot ME 
This is made by tabulating the X column and then 50 E: 
through the observations one at а time and putting а "ua 
against the appropriate X-group for each observation. 

data in Table IT would appear as under in a spot diagram. 


Spot diagram of the data in Table II 
x 
190-193 
194-197 
198-201 
202-205 
206-209 
210-213 
214-217 
218-221 


222-225 
220-229 
230-233 
234-237 
238-241 
242-245 


It helps in countin, 


g the spots fo, 
are put down in Eroups of five, ag 


T the entri 
e. 


es in the f column if they 
aboy 


1. Find the arithm, 


Use formula (1). 


2. Find the arithmetic mean of the scores in test P for 
(а) subjects 1-50, 1 
(b) subjects 51—100. 
Use formula (2). 
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_ 8. Construct grouped frequency distribution tables for the scores 
in each of the tests A to G inclusive for the whole 100 subjects. Use 
the following group units and starting points. 


Test Group unit First group 

p: 4 14—17 

B 15 122-136 
(2) 4 0-3 

D 4 4—7 

Е 6 78-83 

F 2 8,9 

а 3 17-19 


(a) Thence, using formulae (3) and (4); calculate the mean scores 
in each of the tests A to G inclusive for the whole 100 subjects. 

(5) Compare the means obtained with those given by adding the 
means of the four groups of 25 subjects found in Exercise 1 and 
dividing tho totals by 4. These latter means are accurate and will 
indicate the discrepancies introduced by grouping the data. 


4. Find the median of the scores in each of the tests from A to G 
inclusive for 
(а) subjects 1-50, 
(b) subjects 51—100. 

5. From the answers to Exercise 1 comp 
of each test from A to G inclusive for The 
(a) subjects 1-50, " 

(b) subjects 51-100. 


Then, making use of the answers 
the scores in each test for 
(a) subjects 1-50, 
(6) subjects 51-100. 
Use the relationship ‘Mode = mean— 3 (mean — median)’. 


ute the means of the scores 


to Exercise 4, estimate the mode of 


Chapter III 
SCATTER OR DISPERSION 


3.1. Whilst an average to some extent represents the whole 
series of observations of which itis the mean, yet it doesnot by 
itself convey sufficient information about those observations. 
As a general rule it is also necessary to know how the obser- 
vations are scattered around their average. Obviously the 
average of a given number of observations which all lie closely 


» therefore, to give an indication of 
the amount of scatter of the observations averaged, 


3.ii. The range. The simplest measure of scatter or dis- 


persion is the range of the observations, i.e. the distance be- 
tween the largest and smallest observations. The range, how- 
ever, is not a good measure of dispersion. It is based on two 
observations only, instead of on all the available information, 
and those two observations are liable to vary considerably in 
different samples, since the range depends essentially on the 
size of the sample. 

Some writers give the range which includes all the observa- 
tions except the 10 % smallest and the 10 95 largest ga eu 
attempt to allow for the variability of the extremes of a 
sample, but this practice has little to recommend it, 

3.iii. Theinter-quartile range. A meas 4 ч 
which is not quite so subject to fluctuations ae 
inter-quartile range. This is obtained аз follows, First а m 4 
the observations in order of size and find the median ank aj 
tion 2.iv). This divides the sample into two SI 9 зес- 
Then find the median of each half: that of the рено, 
called the lower quartile, that of the upper half г alf is 
quartile. These two quartiles with the median divide m Upper 
range of observations into four equal inter-quartilg ать 

ps, 
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and the distance between the lower and upper quartiles is 
called the inter-quartile range. It is usual to halve this and” 
quote the ‘semi-inter-quartile range e 
Nis issu of dispersion makes use of slightly more in- 
+ mation than does the total range, but it does not lend itself 
0 further mathematical treatment and is not a very valuable 
measure. 


3.iv. The mean deviation. A measure of scatter which 
makes use of all the observations is obtained by writing down 
the difference between each separate observation and the 
average, adding together all these differences without regard 
to their signs, and dividing the total by the number of obser- 
vations. This is called the mean deviation or mean variation and 
is best calculated from the median, as it is then a minimum. 


Example 6. Calculate the mean deviation of the data in 
Example 15 
The median of the о 
Example 4. The differences b 
median, without regard to sign, 
0,1, 3, 1,1, 0,2, 1, 0, 1, 1, 0, 1, 1. 
The sum of these differences — 27, 
N = 


servations is 22, as was found in 
etween each reading and the 
are: 0, 2, 2, 1, 1, 3, 1, 0, 2, 0, 2, 


= 1-08. 


bol bob 
SIS S 


The mean deviation 


3.v. The standard deviation. By far the best and most 
useful measure of scatter is the standard deviation. In words, 
this is the square root of the mean of the squares of the devia- 
tions of the, observations from their arithmetic mean. In 
symbols, if (X — X) represents the deviation of an individual 
reading from the mean and S(X — X)? the sum of the squares 
of all such deviations, then the standard deviation, c; is given 


by the formula (хх 
a J^. (5) 


The square of the standard deviation is called the variance 


or second moment, the latter usually being denoted by Xə- 
csc 2 
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Hence the variance a 
8(Х – Х)2 
NON E 


The method of caleulating the standard deviation depends 
on the type of data and the number of observations, and 
the following sections show how to calculate it in different 
cases. 


= = =. 
Seine 


(5A) 


3.vi. When N, the number of observations, is small and 
the mean of the observations is a whole number, the standard 
deviation is easily calculated by subtracting the mean from 
each observation, squaring each of these differences, adding 
the squares, dividing the sum of the squares by № and finally 
taking the square root of the quotient.* 


Example 7. Find the standard deviation of the first 11 
natural numbers. 


x X-X  (X-Xpy 
1 =5 25 
5 F d S(X) = 66, 
4 -2 4 Bi AN 
5 Sa reer 
6 0 0 BUE 
д 1 1 S(X — X) = 110. 

2 4 
9 3 9 c= [10 
10 4 16 m 
и 5 25 = /10 = 3-16. 
66 110 


3.vii. It rarely happens in 
mean of a set of. Observations is 


* When the variance 


in the whole population is being estimated 
from a sample, the sum o 


f squares is divided by N —1 (see 5.ii, p. 37). 
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differences from the mean are (X, — X), (X, — X), (X; —X), etc. 
Squaring these we get 


(X,-X)? = Xj-2X,X X? 
(X,—X)? = Xi-2X,X +. 
(х, X)? = X3—2X, X 4- X*, etc. 
Summing these we have 
(ХХ): = S(X2) - 2X .S(X) -NX*. 
Divide both sides by N. Then 
S(x-Xp SQQ) ,9500, x» 
у N Х-у + Х?. 
Si = 
But SES — X, by formula (1); hence 


sX- SOD он = а (5B) 


N 
mean of the squares of the 


In words, the variance is the 
observations less the square of their mean. Hence the standard 
le 7 could have been 


deviation of the numbers in Examp 
Calculated as under: 


xX х? 
1 1 
2 4 S(X) = 66, 
3 9 N-1l, 
4 16 Ac 
5 25 6 
6 36 Variance = AP 
7 49 
8 64 = 46—36 
9 , 81 = 10. 
10. 100 Hence с = 4/1055 3:16. 
11 121 
66 506 = 


у extended to a frequency dis- 


The method is more usef! ә 
e have the following formula: 


tribution table, for which w 
x(fX?) 2 6) 
= x ( 


2-2 


а? 
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This may also be written 


DU UR B (6A) 


The use of this formula is illustrated in the following example. 


Example 8. Find the standard deviation of the data in the 
following frequency distribution table: 


2 


[^6 


x f 
1 2 
2 6 
3 12 
4 7 
5 2 
6 1 
Construct two further columns, as under: 
x J JX Sx? N = S(f) = 30, 
1 2 2 2 У(/Х) = 94, 
2 6 12 24 X(fX?) = 332, 
3 12 36 108 339 942 
E 7 28 112 ga — m JI 
5 2 10 50 30 (8) 
6 1 6 36 = 11-0667 — 9.8178 
30 0941 — 33 ive co: 


с = 4/1:2489 = 1-117, 
3.viii. This method 
X column are large, since X2 


case are: X(fa?) [Elfe 
ЕТ AF e 
Using formula (3), this may also be written 
3 2 
one =U) pa (7A) 
Also T=0,x0. (8) 


In these formulae T, is the standard deviation in working 
units. The method of calculation 


is an extension of that used 
in Example 3 and the whole process is shown below. 


3.ix] SCATTER OR DISPERSION 21 


Example 9. Calculate the standard deviation of the data in 
"Table II. $ 


x 4 v Ja fe 
190-193 2 4 San 2 
194-197 4 Er M. d 
198-201 7 25 NS г 
202-205 12 T7 TI 115 
206-209 19° = 57 171 
210-213 24 D E En 
214-217 27 E Сэт 37 
218—221 35 0 — 953 T 
222-225 26 1 26 26 
220-229 21 2 42 94 
230-233 18 3 54 162 
234-237 13 4 52 208 
238-241 6 5 30 150 
249-245 5 6 30 180 
246-249 2 7 14 98 
250-253 1 8. rs _ 64 

222 256 1875 
— 253 
3 


btained by multiplying together the 


The fx? column is 0 
the z and the fz columns. 


corresponding entries in 
5 2 
= 8:4457. 
+ gio „84457 = 2:906. 
g = %906х4 = 11.624. 
Зах. The coefficient of variation. If we wish to compare 


the scatter of different variables about their means, it is useful 
to be able to express the scatter in some form which is not 


dependent on the absolute size of the variables. For instance, 
mice and men may bere nlength, but 


Jatively equally variablei b 
this would not be revealed by stating the standard deviation 
in inches of a sample of each. 


Comparison may, however, be 
usefully made by calculating the coefficient of variation in each 


D 
essioned No. А 


No 4 50. 4 
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case. This is readily obtained Бу expressing the standard 
‘deviation as a percentage ratio of the mean, i.e. 


100 
Eu. (9) 


ГА 
where m is the mean of the sample. 


The coefficient V is a ratio and is therefore independent 


of the units in which the mean and standard deviation are 
measured. 


Example 10. Compare the relative variability of the data in 
Examples 7 and 9. 


In Example 7, X = 6, 
с 


= 3:16. 
X Vim 2008716 oor. 
6 
In Example 9, Х = 219.55, 
o = 11-624, 
и 100 x 11-624 
ES У = — 719.55 — = 5:29 
Hence the data in Example 7 are about 10 times as scattered 


relatively to their mean 


as are those in Example 9. 


EXERCISES ON CHAPTER III 


6. Find the upper and lower quartiles of the Scores in each test 
from A to G inclusive for the whole 100 subjects and thence derive the 
Semi-inter-quartile rango of each test. 


7. Calculate the mean devia: 
- from A to G, using the medians 
(a) subjects 1-50, 
_ (b) subjects 51-100, 


tion of the scores in each of the tests 
given in the answers to Exercise 4, for 


8. Using formula (5), 
scores in 
(a) test E, subjects 1-25, 
(b) test D, subjects 76-100, 


calculate the standard deviation of the 
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9. Calculate the standard deviation of the following test scores, 
using formula (6); tests А, C, D, Е and G: 


(а) subjects 1-25, 

(5) subjects 26-50, 
(c) subjects 51-75, 
(d) subjects 76-100. 


10. Using the grouped frequency distribution tables constructed in 
Exercise 3, calculate the standard deviation of the scores of each test 
from A to G for the whole 100 subjects. Use formulae (7 A) and (8). ' 

From these standard deviations and the means given in the answers 
to Exercise 3 (а), calculate the coefficient of variation of each test 


Score. Use formula (9). 


Chapter IV 


THE NORMAL AND BINOMIAL 
DISTRIBUTIONS 


applied unless this assumption is justified, Biological variables 
are often in point of fact 80 distributed but it is not safe to 


assume that any particular distribution is normal without 
examination, 


Frequency scalo 


190- 194- 198 202- 206- 210- 214- 218- 222-226- 230- 234- 238- 242- 246- 250- X'groups 
193 197 201 205 209 213 217 22 225 229 233 237 241 245 249 253 group 


Fig. 1. Histogram of frequency distribution, 


The meaning of a normal distribution та, 
understood by consideri 
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these lines is indicated by the frequency of a particular X- 
group and the frequency scale on the left. Finally we join these 
vertical lines together by horizontal lines and we have what is 
called a histogram of the frequency distribution of Table П. 
This is illustrated in Fig. 1. 

This figure gives a picture of the distribution of the variable 
X and it shows that most of the observations are clustered 
about the middle of the range and that there are relatively few 
observations at the extremes. The total frequency of the 
observations is given by the area between the base-line, the top 
of the histogram and the vertical lines at the boundaries if the 
interval between the groups is 1 unit. It will be seen that the 


Fig. 2. The normal distribution. 


rather irregular. If, however, a very 
large number of observations had been made and the points 
along the base-line had been much more numerous, then the 
boundary of the histogram would have become a smooth, con- 
tinuous line, in shape something like the curve shown in Fig. 2. 

Fig. 2 shows the shape of a normal distribution, sometimes 
called a Gaussian distribution. This curve has various im- 
portant properties, some of which must be mentioned. 


44i. The normal distribution has an exact mathematical 
formula. It is a continuous curve and applies to continuous 
variables, such as height, where the difference between one 
value of the variable and the next can be indefinitely small. 


top of the histogram is 
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Mathematically the curve stretches to infinity in both direc- 
tions, but practically only the portion drawn above is of 
importance. 

The mean of the observations is in the exact centre of the 
curve and there is the greatest number of observations at this 
point. Since the curve is symmetrical, the median and the 
mode coincide with the mean. 

The area between the curve and two uprights drawn at any 
points gives the fraction of the total number of observations 
between those points. In Fig. 3 uprights have been drawn at 
points corresponding to 1, 2 and 3 times the standard deviation 
of the distribution on each side of the mean. 


Fig. 3 


The area between —c and c is 68 % of the tota] area, This 
means that in a normal distribution 68 % Of the observations 
lie within a distance equal to the standard deviation on each 
side of the mean. Similarly, from —2¢ to 2c includes 95 y 
and from —3c to Зо includes 99-7 % of the observations. 
Hence it is obvious that in a normal distribution practically 
all the observations lie within a range of 6 times the standard 
deviation. This provides a rough check on the size of a cal- 
culated standard deviation: if the number of Observations is 
large, the standard deviation should be approximately a sixth 
of the range. (Note then in Example 9 the Standard deviation 
was 11:6 and the range was 63, nearly 6 times ав great.) 
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Exactly half the observations are included in an area 
bounded by a distance of 0-674490 on each side of the mean, 
shown by dotted lines in Fig. 3. This means that the chances 
are exactly equal that a single observation shall deviate from 
the mean by an amount greater or less than 0-674490. Use is 
made of this fact in calculating probable errors, which will be 


explained later. 

Expressing the above measurements in a different way we 
may say that a value of X whose deviation from the mean, 
either positively or negatively, is greater than c will occur 
Toughly 1 in 3 times; a positive or negative deviation greater 
than 2¢ will occur about 1 in 20 times, and greater than 37 
about 1 in 370 times. Tables have been calculated showing the 
Probability of obtaining deviations of any size. Such a table 
is called a Table of Probability Integral and may be found in 
Lables for Statisticians and Biometricians (Ref. З). — ale 

On such facts is based the conception of statistical Ora 
Jicance. The term ‘significance ' js used in statistics to indicate 
that the odds are heavy against the deviation from its expected 

fficient occur- 


value of a particular estimate, difference or coe u 
ting by chance as & result of random sampling. In practice 


odds of 19 to 1 against an occurrence by chance are usually 


taken as indicating the Si :бсапсе of that occurrence. This 
the odds of getting а deviation from 


distribution greater than twice the 
ао either positively ог negatively. (Some 
statisticians prefer heavier odds, such as 99 to 1, as their 
Criterion of significance. This must to some extent dependon 
th arjables, b $ 
med Ер навое js usually regarded as sufficient for 
Sio В 
Е hich is symbolised as P, is usually expressed 
А x a ЭКЕ so that odds of 19 to 1 and 99 to 1 are 
ни еЗ А = 0:05 and P = 0-01 respectively. Writers some- 
ns ipe Uo Es the 5% and 1 % levels of significance. 
А mality of a distributi Tt i 
] ting the nos MIT Ur iS 
алі. Tes pell-shaped distribution as being neces- 


unsafe to regard ary 
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sarily a normal distribution, and since so much of statistical . 
method depends on normality it is important to be able to 
test any given distribution for normality. This may be done 
quite simply by mathematical means, although the process 
requires a good deal of arithmetic. 

Essentially, testing the normality of a distribution depends 
on the calculation of two constants, f, and f, which are 
derived from the first four moments about the mean of the dis- 
tribution, and two further quantities, y, and у», which are 
related to f, and £, according to the following equations: 


УЕ Afi, 

Ya = 0—3. 
Yı is a measure of whether or not the distribution is sym- 
metrical. у measures departures of a, symmetrical nature from 
normality. The use of these constants will be explained later. 
Now the first four moments about the mean of a frequency 
distribution are denoted by / Ho, Из and ра, and these are 
calculated, with certain theoretical corrections, by a method 
which is simply an extension of that already used in calculating 
the standard deviation, as in Example 9. The student should 


of which will yield X(fz3) and X( Јал). Tf these four totals are 
divided by N we get four quantities denoted by vi, и, v and v4. 
The moments about the mean are then obtained from the 
equations: а = 0, 
Из = 03—012, 
Из = V3— 30103 + 20/3, 
[а = V4— Фриз + буи? — Зиа. 
When the variate is continuous certain Corrections have to be 
applied for grouping, and the equations then become: 
д = 0, 
Mg = V3 — V — ds, 
Из = Va — 3v va + 213, 
Ш = V4 — Avi и + биту, — Зи — Зи ao: 
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Having obtained these four moments, / and р, are given by 


the formulae Е 
Eus _ а 

ГА = n E В а ГЕ 2 
From these у; and уз are readily calculated. The standard 
errors* of y, and уз are 4/(6/27) and „/(24/N) respectively. This 
means that if the values of y, and уз are less than twice these 
Standard errors, then the distribution is not significantly dif- 
ferent from the normal form: if they are greater than twice 
their standard errors, the distribution is not normal. 

This mass of symbolism will probably be alarming to the 
elementary student, but the actual process of calculation in- 
volves arithmetic only and is shown in the following example. 


Example 11. Test the normality of the distribution given in 
the first two columns of the table below: . 


f x fe Je 160 fat 
1 EU E fg 25 —125 625 
2 m LES 32 —128 512 
5 is —15 45 — 135 405 

10 Bs —90 40 — 80 160 

20 Е —20 20 — 20 20 

50 0 = 68 0 Rin o 

E 1 22 22 22 22 
1h 5 22 44 88 176 
Е d 15 45 135 405 
3 “ 4 12 48 192 768 
Я Б 5 25 125 625 

Mss —16 346 502 3718 
— 68 —488 
8 74 
и = 8/130 = 0:0615, 
и = 346/130 = 2.6615, 
м = 74/130 = 0-5692, 


y, = 3718/130 = 28-6000. 


* The meaning of the term ‘standard error’ is explained in 
Chapter V- 
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Hence 
дь = 2:6615 — 0-06152— 0-0833 = 2-5744, 
Из = 0:5692 — (3 x 2:6615 x 0-0615) + (2 x 0:0615?) = 0-0787, 
Ш = 28:6 — (4 x 0:0615 х 0-5692) + (6 x 0-0615? x 2-6615) 
— (3 x 0-06154) — 1(2-5744) — 0-0125 = 27-2207. 


‚ _ 0:07872 
= = 0:000363, 
Bis ета — 0:000 
27-2207 
= = = 41072; 
Ba 2.57442 
У = 00191 (у; has the same sign as дз), 
6 
= /— = 0:915; 
S.e. 190 0:215; 
Уз = 1-1072, 


24 
S.o. = NES — 0:430. 

It will be seen that y, is considerably less than twice its 
standard error, hence the distribution is symmetrical, 7, how- 
ever, is more than twice its standard error, so that the distri- 
bution departs from normality in a symmetrical manner, Узі 
said to measure kurtosis. Curves which are flat-topped and 
short-tailed compared with the normal curve are called platy- 
kurtic: for these Ёз із less than 3. Curves which are sharply 
peaked and long-tailed, and for which В. is greater than 3, are 
called leptokurtic. In the above example, р, is greater than 3, 

' 80 that the distribution is leptokurtic and not normal. The 
student should draw a histogram of this distribution and note 
the peaked shape of it. 


4iv. The above is not the only method of testing the 
normality of a distribution. An alternative method is to fit a 
normal curve toa frequency distribution and test the goodness 
of fit by the y? method. (See Section 9.vii.) 


4.v. The binomial distribution and the calculation of 
probabilities. In some cases the significance of observed 
results may be tested by use of the binomial distribution. 
This distribution is well known in algebra and consists of the 
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expansion of the expression N (p +4)". The full expansion of 
this general expression, written symbolically, is : 


N(p4qy = N "пр" q+ v D рпа 


ЕО eg oat er]. (10) 
Usually in probability problems, p is the probability of an 
event happening, expressed as à fraction, and 9 the proba- 
bility of its not happening, so that p+q = 1. Nisthe number 
of trials made and n is the number of events in each trial. 
For example, suppose we toss 4 pennies 32 times. Tossing 
1 penny is an event; tossing 4 at once is a trial of 4 events and 
in our experiment we make 32 trials. We wish to know the 
probabilities of getting 4 heads, 3 heads and 1 tail, ete. Since 
the chance of getting either a head or a tail in one event is $, 
the probabilities of getting the different results will be given 
by the successive terms in the expansion of 32(} + $)*. Sub- 
stituting these figures in the general expansion (10) we obtain 


зарри = 21040 esa? 


£32 ау). 


Evaluating the successive terms in this we have, calling 


heads H and tails 7, 
chances of getting 4H and 07 are (3)! x 32 = 9 
chances of getting 3H and 17 are 4(3)* (3) x 32 - 8 
4.8 я 
chances of getting 2H and 2T are 175 (0) ($)х32 = 12 
4.3.2 &. 
chances of getting 1H and 3T are 1753 (0) 0) х32 = 8 
chances of getting ОН and 4T are ($)*х 32 = 2 
Total 32 
If we wished to know the odds of getting at least 2H showing 
d them by summing all the terms in- 


in each trial, we could fin 


volving two or more heads, i.e. the first three terms. Hence out 
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of 32 trials we should expect to see two or more heads in 
2+8+ 12 = 22 trials. The process is similar in dice-throwing. 
Suppose, for instance, we throw 3 dice 60 times: how often 
should we expect to get 3 sixes? The probability p of getting 
a six in a single event is 4, so that the chances of getting 3 
sixes in 60 trials of 3 events is given by the first term in the 
expansion of 60(4-+4)%. This term is 60(2)° = 8%. We should 
not therefore expect to get 3 sixes in as few as 60 trials, in fact 
we should need to make at least 216 throws before we could 
expect this to occur once. This does not mean, of course, that 
3 sixes would be thrown at the 216th throw—they might be 
thrown at the first attempt—but it means that the chances of 
throwing 3 sixes are 1 in 216, so that ina very large number of 
trials only 1 in 216 would yield this result. 

A more complicated problem would arise if we asked how 
often we could expect a total score of 15 or more if we threw 


6, 6, 4:6, 4, 6:4, 6, 6:6, 5, 5:5, 6, 50r 5, 5 
the possibilities in this way we 


No. of 


No. of 
Score possible ways Score possible ways 

18 1 10 27 
17 3 9 25 
16 6 8 21 
15 10 7 15 
14 15 6 10 
13 21 5 6 
12 25 4 3 
11 27 3 1 

Total 216 


By addition we see that there are 216 wa: i i 
S of getting dif- 
ferent scores. As a check on thi: d ER 


t e is total, note that there are 6 
ways in which the first die may fall, and for each of these there 
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are 6 ways in which the second die may fall, and for each com- 
bination of the first and second there'are 6 ways in which the’ 
third may fall. The total number of possible combinations of 
the three, therefore, is 6 x 6 6 = 216. Of these 216 possible 
score combinations, 1+3+6+10= 20 will give a score of 15 
or more, hence the number of times we should expect a score 
of 15 or more in 60 throws of 3 dice is 
" 20 
60 x 216 = 


Although this illustration does not directly make use of the 
binomial expansion, it gives an example of the calculation of 
probabilities—a process which is not always obvious. 

As a further example, using the binomial expansion, take 

-the case of a football coupon where the results of 14 matches 
have to be forecast. The result may be a win, a draw or a loss 
in any one match. Assuming that the chances of a win, a draw 
or a loss are equal (this is probably not true in practice but it 
is assumed here for the sake of illustration), the probability of 
getting any single forecast correct by chance is 1 in 3. What is 
the probability of getting all 14 forecasts correct and all 14 
wrong? The answer to this is given by the first and last terms 
of the expansion of the binomial (4 2)1*. The first term is 
($) = 1/4,782,969. The last term is ? 

“(8019 16,384/4,782,969 = 1/342:55. 

ake sure of one correct forecast in all 14 cases one 

$ million coupons, each with a 

very 343 


5:5 


Hence to m 
would have to fill in over 4 e 
different forecast, whereas on the average 1 form ше 


would be completely wrong. 


4.vi. Now the mean of a binomial distribution (p+ gq)” is 
np and the standard deviation is J/(npq). Suppose we toss & 
penny 100 times and count а head as a success. If the penny is 
unbiased, the mean number of successes would be 100 x $ = 50. 
Tf, however, in a particular trial we obtained 62 heads, could 
we regard the penny as being biased? The standard deviation 
is J(100 x 4x 4) = 5. The observed value of 62 successes differs 


from the expected mean by 12, and 12is 2:4 times the standard 


esc 3 
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deviation. Reference to the table of the probability integral 
shows that the probability of getting a deviation from the 
mean of 2-4c or greater is just less than 0-02, so that we should 
strongly suspect the penny of being biased. Note that in this 
example we should have obtained the same result if we had 
had 38 observed successes, since the probability of a deviation 
of 2-4c or more on either side of the mean is just less than 0-02. 
If, however, the penny were tossed in an electrical field which 
we had reason to suspect would cause more heads to appear, 
we should have to ask what is the probability of obtaining the 
same deviation from the mean in a positive direction only. 
This would be halfthat given in the table, i.e. with 62 heads it 
would be just less than 0-01 (0-0082 to be exact). Hence we 
should conclude that the electrical field was definitely having 
the suspected effect. 

The question of when to use P and when to use 4P for testing 
significance is apt to be a difficult one. The answer depends 
entirely upon the hypothesis we are testing. For example, 
a test is given to 100 subjects twice and it is found that 59 
subjects get a better score the second time and 41 get a worse. 
With these data we may test two hypotheses. 

(а) Assuming that it is equally likely for a subject to get a 
better or a worse score the second time, the chance of a subject 
getting a better score is $. We may then test the hypothesis 
that the observed distribution of better and worse scores does 
not differ significantly from chance. As in the previous ex- 

; ample, we should expect 50 subjects to improve by chance with 
a standard deviation of 5. The observed deviation from the 
mean is 9 which is 1-8 times с. From the tables it may be seen 
that for this deviation P = 0-071. Hence we conclude that the 
observed distribution does not differ significantly from chance. 

(b) We may, however, decide on psychological grounds that 
doing a test twice is likely to have a disturbing influence on 
the scores, and we proceed to test the hypothesis that twice- 
testing will cause an improvement in scores, Here we are 
dealing with deviations from the mean in a positive direction 
only, so that the probability is iP = 0:035. From this we 
should conclude that scoring on the test has improved signi- 
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ficantly on a second testing. Note that this shows only that 
a greater number of individuals than would be expected by 
chance improve on a second testing. The amount of improve- 
ment, as measured by the number of marks scored, might still 
not be significant. To decide this further point it would be 
necessary to average the 100 differences between the first and 
second scores and apply the t test to see if this mean differed 
significantly from zero. 

‘As anote of warning, it must beemphasised that the grounds 
for using 1P аз a measure of significance must be exceedingly 
firm and justified before making the actual test. If there is the 
slightest doubt, it is far safer in the interests of truth to use 
P as a criterion. This is a matter of statistical ethics. The 
honest inquirer makes up his mind before analysing his data 
what can reasonably be expected from them, and for convic- 
tion about the truth of his statistical results he must feel that 
they are not based on any specious argument, however 
plausible. 

EXERCISES ON CHAPTER IV 


11. Using the method of Section 441 (Example 11), test the 
normality of the distribution of the scores in test A for the whole 
100 subjects. Make use of the grouped frequency distribution table 


constructed in Exercise 3. 
12. Repeat Exercise 11 for the whole 100 scores in test C. 


13. Five pennies are tossed 320 times. How often would you expect 


at least 4 heads? 
14. Sixty-four people are asked a problem question, the answer to 
which can be only ‘Yes’ or ‘No’; 38 people answer ‘Yes’, which is 
correct, and 26 answer ‘No’. What are the chances of obtaining this 


result if all the answers were guessing? 


Chapter У 


SIGNIFICANCE OF MEAN AND DIFFERENCE 
BETWEEN MEANS 


5.i. The observations recorded in a single biological experi- 
ment are but one sample drawn from the whole population of 
possible samples. If a second experiment is made it is unlikely 
that the mean of the observations in this case will be identical 
with that of the first experiment. In short, it will be found that 
a large number of experiments will yield many different values 
of the mean, each one departing more or less from the true 
mean of the whole population. - 

If the standard deviation of the whole population is c, and 


standard deviation Tp| 4n. If the population is normally dis- 


с 
S.e. of mean TUN: (11) 
where c is the Standard deviation of the 
number of observations in it. 
(There was formerly a practice, which has little to recom- 
mend it, of calculating the probable error of the mean. This is 
given by the formula 


Л 0-67449g- 
P.e. of medn = any = А (ПА) 


sample and N the 
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It may be noted that three times the probable error is roughly 
equal to twice the standard error.) 5 


5.1. Significance of a single mean. If we have опу a 
single sample to give estimates of X and c, then the distribu- 
tion of X, [с will not be normal, However, the correct distribu- 
tion has been calculated and tables have been made enabling 
us to make use of the data from a single sample; an extract 
from these tables will be given later. 

When calculating the standard deviation for the purpose of 
examining the significance of the mean, S(X — Xy. should be 
divided by (N — 1) instead of by N. The reason for this is that 
we are making an estimate of the standard deviation in the 
whole population and this is best obtained by dividing the 
sum of the squared deviations from the mean by the number : 
of degrees of freedom available. This number is one less than 
the number in the sample. » 


Tasun ПТ. Values of t corresponding to 
a probability Р = 0:05 


n t n t n t 
1 12.706 11 2:201 21 2-080 
2 4-303 12 2.179 22 2.074 
3 3:182 13 2-160 23 2.069 
4 2.776 14 2.145 24 2-064 
5 2-571 15 2-131 25 2-060 
6 2-447 16 2-120 26 2-056 
7 2-365 17 2.110 27 2-052 
8 2-306 18 2-101 28 2.048 
9 2.262 19 2-093 29 2.045 

10 2.228 20 2-086 30 2:042 

For n = ©, t = 1:96. 
The above table is an extract from a full table given by R. A. Fisher 


(Ref. 2), Table IV; orin ihe Fisher and Yates Tables (Ref. 5), Table ITI. 

Having obtained the mean and standard deviation we need 
to calculate a statistic known as t; this is essentially the ratio 
of the mean to its standard error. (t may also be the ratio of 
a difference between means to its standard error: see below, 


Section 5.iii (0).) 
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For а single mean = 
va ОМ 
t= X+—_= № 
JN c 


(12) 


In Table Ш are given values of £ corresponding to different 
values of n, the number of degrees of freedom, i.e.n = N — lin 
this case. The odds against values of t as big as or bigger than 
these occurring by chance are 19: 1, i.e. the probability, usually 
denoted by P, of their occurring by chance is 0-05. If the 
calculated value of t is greater than that given in the table for 


the appropriate value of т, then the mean is significantly 
different from zero. 


Example 12. Ten schoolchildren were given an arithmetic 
test. They were then given a month’s further tuition and a 
second test of equal difficulty was held at the end of it. Their 
marks in these two tests are given below. 


Scholar Test 1 Test 2 
1 20 22 
2 18 19 
3 19 17 
4 22 18 
5 17 21 
6 20 23 
7 19 19 
8 16 20 
9 21 22 

10 19 20 


Do these marks givee 
by the extra tuition? 
This problem resolves its 
of the differences betwe 
different from zero? 


We need first to construct a third column giving the values 
of (Test 2— Test 1); this willbeour X column. Add this column 
to get S(X) and obtain X by dividing S(X) by N (formula (1)). 
Make a fourth column giving (X — X) and a fifth giving 
(X — X). Add this fifth column to obtain S(X — X)?, This 
gives us all the necessary data for calculating t. The last 


vidence that the scholars had benefited 


elf into the question, Is the mean 
en successive marks significantly 
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three columns and the remainder of the working are shown 
below. : 


Test 2— Test 1 


X X-X (X-<X)? S(X) = 10, 
2 1 1 X=1, 
Я 0 0 S(X—X)? _ 58 
в E ? О 
—4 E 25 Y 
4 3 9 das ups 
3 2 4 9° 
0 —1 1 Hence, by formula (12), 
я 3 > _ 1х0 _ [90 
1 0 0 5 eS 
10 58 = 1:25. 


Reference to Table III shows that forn = 9,1 = 2:262. Our 
caleulated value is less than this, hence the mean of X is not 
significantly different from zero, and the marks are insufficient 
to prove the benefit of the extra tuition. 

The above method of calculation is not convenient if the 
mean is not a whole number. We may, therefore, adopt an 
alternative method of calculation, making use of the identity 


sux —Xy = sac) - ISQOP- 


all that is needed is to construct and 


In the above example 
he observations, i.e. a 


add a column giving the squares of all t 
column of X? yielding & total of S(X2). In Example 12, 


S(X3) — 4p 14-44-16 4-164-940-1611 = 68. 


Hence S(X — Xy = 68 — 10(10)° 
= 68—10 
= 58. j 
From this point the calculation of t is identical with that 


given in Example 12. 
Significance of the difference between means. 


d often occurring problem is to determine 
n two observational 


5.11. 
An important an 
whether there is a real difference betwee: 
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means or not. Tn statistical language this problem may be ex- 
pressed in the words, Is the difference between the means such 
that they might have been drawn from the same population 
by random sampling or are they drawn from two different 
populations? There are two methods of dealing with this 
question appropriate to the cases in which the samples are 
large or in which they are small. 
‚ (a) Large samples. If the numbers of. observations in the two 
samples are large, say, at least 50, the question may be settled 
by calculating the standard error of the difference between the 
means. If the means are X, and X,, their standard deviations 
с; and о’, are the numbers in the samples №, and № гезрес- 
tively, then the standard error of the difference between the 
means is given by the formula 

Б.е. of difference = (+9) d (13) 
This formula applies only if the two variables are independent 
or uncorrelated (see Chapter үт). 

If the two variables are correlated 


ey) 
(S.e. of difference)? — Su Ga 2 RP SENE 
"XE x Ux) 
where r is the coefficient of correlation. 
Аз before, the probable error of the difference would be 


Р.е. of difference — пото (Si. 52 (13A) 
N N, 

Tf the difference between the two meansis greater than twice 
its standard error (or three times its probable error), then the 
means are significantly different, i.e. it is unlikely that they 
would be drawn from the Same population by random samp- 
ling, the odds against being at least 19 to 1, 


Example 13. A group of bo 
an intelligence test, The mea; 
numbers in the groups were 


ys and a group of girls were given 
п scores, standard deviations and 


as follows: 
Boys Girls 
Mean 124 121 
e 12 10 
N 72 


50 
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Was the mean test score of the boys significantly greater than 
that of the girls? Ў 


Difference between the means = 124— 121 = 3, 


144 100 
S.e. of diff = ЕЕ: it e 
e. 0: erence J( 72 + a 2A 
Difference 3 
H . Difference Sgt 
епова нетен роо 


In this experiment, therefore, the mean intelligence test score 
of the boys was not significantly greater than that of the girls. 

(b) Small samples. When we wish to compare the means of 
small samples of less than 50 observations, the use of formula 
(13) is no longer a sufficiently strict test and we have to apply 
the ¢ test of significance. The test is essentially similar to that 
in the previous section but corrections have to be made to 
allow for sampling errors, which are more important in small 


samples. 


Suppose we have №, readings ofa variable x, and № readings 


of a variable a and we wish to see whether or not the means 
z, and z, differ significantly from one another (or, in other 
words, we wish to see what is the probability that the two 
samples could be drawn from the same population). To apply 
the 1 test in this case we need to know six quantities, №, №, 
S(x), (v5), (25) and S(a3). 

= S(x) ban S (s) 

Rue and %= NC 


As usual, 


The variance of the combined observations, which we shall 


denote by ся, is 
of S(x, —%)? + S(t — 23) 
d М-+№-—2 4 


and the standard error of the difference is 
1 1 
S.e. of difference = оз (+) А 


(The reader should compare this expression with formula (13) 


in the case where 0; = 90°.) 
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In this case t is the ratio of the difference between the means 
to the standard error of the difference, i.e. 


E 


1 x) 
e NX, 


We must now express с. in terms of the six quantities with 
which we started. We have 


t= (14) 


sep -EEN gag) Bea 


o= 2 


NEN 
Written in full, therefore, 
S.e. of difference 


(ух) ces 
i N N, N +N,- ; r 
[S(x,)]? [S(x n 
2 1 2 2, 
х (вез м. +86 Ы. 
The part dealing with N, and AN, under the first square-root 
sign reduces to М+М 
1 2 
Ма). 


for which we may write VN’. For the convenience of students, 
values of J(1/N") for № and N, between 10 and 50 have been 
tabulated in Appendix A. 

Using therefore the six quantities with which we started, we 
may therefore express t in the following form: 


la- fy 
М в 
Hy 2)* 

"| [se SM +82) — 22 | 
The use of this formula is shown in the example below. 
In this case, for using Table IIT to find the probability of 

obtaining observed values of t, the value of n is 
n=N,+N,—2. 

Note. The t test is ар 


je t plicable only when the variates are 
normally distributed and are not correlated. 


(14.4) 
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Example 14. The span of apprehension of two small groups 
of children, one from the lowest class in a school and the other 
from the top class, was tested by seeing how many digits they 
could repeat backwards from memory after hearing them once 
repeated forwards. The numbers of digits correctly repeated in 
the two cases were as follows: 

Group А 3 5 6 4 3 3 4 
Groupi B ИБО е оте 


Is there any real difference in the span of apprehension of ће 


two groups? 
We have the following data: — 
For group A, М=Т, For group B, № = 8, 
S(X,) = 28, ` 8(Х,) = 62, 
S(X3) = 120, S(X3) = 516, 
X, = 4-00. X, = 7-75. 
Hence X,—X, = 7-75—4-00 = 3-75. 
Now dod в = 69 
2 
Alen 8(X2)— BEE sap- S . 
784 - 3844 


= 120-—- + 516——— 
= 120— 112+ 516 — 480:5 
= 48:5. 
Hence, by formula (14А), 
3.75х 6:97 26:1375 
t= 435 6595 
= 3:96. 

For using Table III in this case, the value of n for entry is 
748—2-= 13. The value of t in Table III corresponding to 
п 13 is 2-160. Our calculated value of t is considerably 
greater than this, hence the difference between the means is 
significant and we conclude that the span of apprehension of 
group B was significantly greater than that of group A. 
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Tf the actual observations in the two groups for comparison 
are large, the arithmetic in calculating the standard error of 
the difference between the means may be fairly arduous. In 
such cases the work may be lightened by working from arbi- 
trary origins, provided the range of the readings is not too 
great. To do this, we choose a convenient arbitrary origin, or 
starting point, near the middle of the range of observations. 
Calling this origin A, we construct a column headed (X—A) 
by subtracting A from each reading in turn. Care must be 
taken of the signs since some of the entries will be negative. 
Summing this column gives us S(X —A). Each entry in this 
column is then squared and the resulting (X — A)? column 
summed to obtain S(X — A)? 

We then made use of the following identities: 

S(X) = S(X —A)- NA, 
S(X?) = S(X — AP NA? 4-24 .S(X), 


S(x-Xxp- SQ Ay [S(X — Ay. 


The method of extending this process to two groups is 
illustrated in the following example. 


Example 15. Ts the difference between the mean reaction 
times of the following two groups significant? 
Group А 98 97 104 106 100 111 99 99 101 102 
Group B 100 94 93 99 101 87 86 91 85 86 89 
The range of observations in group A is from 97 to 111, so we 
may choose 105 as a convenient origin, i.e. A, = 105. Call the 
readings in group A, X,. We then construct a column headed 
X,-A, and sum it (see table on p. 45). Вась reading in this 
column is then Squared, giving a column headed (X,—A,)? 


which we also sum. The process is repeated with group B 
taking A, = 90. 


Substituting these totals in the ide. 
we have for group A, 
S(X,) = S(X,— 43) - N, A, 
= — 33 + (10 x 105) 
= 1017, 


ntities given in the table 
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whence X, = 1017/10 = 101-7, 
S(X,— X, = ВХА) ISO - AVP 
= 273 — 332/10 


7973 — 108:9 = 164-1. 


Similarly for group В, 


X? = 91-9091, 
§(X,—X,)? = 354-9091. 

X,—A, (X,—A,)* X,—4; (X3—4,)* 
= 7 49 10 100 
— 8 64 4 16 
= 1 1 3 9 

1 1 9 81 

= 5 25 11 121 
6 36 — 3 9 

= 6 36 — 4 16 
— 6 36 1 1 
—4 16 L5 25 
= 3 9 = 4 16 
Sil fy 

—33 278 21 395 


From Appendix A, for №, = 10 and № = 11, we find that 


» — 9-98, Substituting in formula (14A) we obtain 
od. (101-7 — 91-9091) 9:98 A 97.7132 143 
~ A641-4 354-9091) 22-78 

From Table III with n = №+№-2 = 19, we find a critical 
value for t of 2:093. Our calculated value is much larger than 
this, hence the difference between the means of the two groups 
may be taken as significant. 

iv. Variance differences: the varianceratio. À point 
that is frequently lost sight of is that when two means are 
judged to be signifitantly different by the t test, this result 
may be due not to the fact that the two groups of observations 
are drawn from populations with different means, but to the 
fact that the variances of the two groups are significantly 
different. This may, and should, be examined by calculating 
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the variance ratio of the two groups and testing its significance 

Љу reference to Fisher and Yates's Table’ V. The variance 
ratio (у.в.) is simply obtained by working out the variance of 
each group and dividing the larger by the' smaller. The 
number of degrees of freedom available for this is N— 1 for 
each group, but note that in referring to Fisher and Yates's 
table, n, is the number of degrees of freedom in the larger 
variance. 

Applying this to the Example 15 of the previous section, 
we have for the variance of group A, 164-1/9 — 18:23 and for 
the variance of group B, 354-9091/10 = 35-49. 

Hence the v.n. = 35:49/18-23 = 1-95. 

Referring to Fisher and Yates's table, we find that for 
^ = 10 (since the variance of group B is the larger) and 
п = 9, the critical value of the v.n. is about 3-1 (its exact 

` value could be found by interpolation). The calculated v.R. 
is much smaller than this, so we conclude that the two vari- 
ances are not significantly different. 


5.v. Significance of difference between proportions. 
It frequently happens that no exact measurement of a quality | 
is possible but the presence or absence of the quality may be 
observed. For example, we may note whether or not a person 
has blue eyes without attempting to measure intensity of 
blueness. In the same way we may note what proportion of 
a group of objects have a certain quality and we may wish on 
oceasion to examine whether the proportion of one group 
possessing a quality is really different from the proportion of 
another group possessing that same quality. As in the case of 
the difference between means, the problem may be expressed 
in the form: What are the odds against obtaining by chance a 
difference of proportion as big as the one observed in a homo- 
geneous population? Now in the case of investigating the 
significance of the difference between means, the probability 
of obtaining differences of various sizes depends on the dis- 
tribution of the variables, assumed in this instance to be a 
normal distribution. In the case of proportions the proba- 
bilities depend upon the binomial distribution (see Chapter ty). 
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The parameters of this type of distribution are different 
from those of a normal distribution but the method of 
examining the significance of differences of proportions is 
similar. 

Let N be the number of individuals in a group and р be 
the proportion of them possessing some particular quality. 
Usually p is expressed as a decimal fraction, and we call the 
proportion not possessing the quality q, so then q = 1—p. The 


standard error of a single proportion is a . If we have two 
and № individuals respectively and if the pro- 


groups of № 
and ps, 


portions of them possessing the same quality are p, 
we need an expression giving us the standard error of the 
difference between p, and pə. 

First we have to estimate the proportion of individuals in 
the combined groups who possess the particular quality. If 
Bs. As before, q — 1— p. 


2 


pis this proportion, then p = 


We have, then 
d error of difference between proportions 


Standar 
= (+): (15) 


The differences between proportions drawn from a homo- 
geneous population are not distributed in the same way as 
the differences between means drawn from a normal popula- 

fer to require an observed difference 


tion. It is therefore sa: quire 
between proportions to be three times its standard error before 


ming significance. 
и examining the significance of the difference be- 
tween proportions, the actually observed proportions must 
еа ИО expressed as percentages, the 
erned are assumed to be of equal size, and this will 


s conc! 
lead V error except when the groups are actually equal. 
le 16. Ina group of 50 boys, 24 are over 4 ft. in height, 


whilst 18 out of a group of 00) girls are over 4%. Can the 
difference between the proportions of taller children in the 


two groups be regarded as real? 


Examp' 
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We have 
М = 50, p, = 24/50 = 0-48, 
Nz = 60, р, = 18/60 = 0:30. 
The combined proportion, 
(0-48 x 50) +(0:30x 60) 24418 
50+ 60 110 
q = 1— 0-38 = 0-62. 
Hence s.r. of difference of proportions 


= (+ 
= d oss oen (ac) 


= A/(0:2356 x 0-03667) 
= 10:008640 
= 0:093. 
The difference 
Pı— P = 0:48 — 0:30 = 0-18. 


‘The difference is not quite twice its standard error and we 
therefore conclude that the difference between the proportions 


of taller children in the two groups cannot be regarded as 
a real one. 


= 0:38, 


EXERCISES ON CHAPTER V 


15. Subtract the score in test D from the score in test О for each 
of the first 25 subjects. Calculate the mean difference between these 


scores and determine whether this mean is significantly different from 
zero or not. Use formula (12). 


16. Using formula (13) and the standard deviations given in the 


answers to Exercise 10, calculate the standard error of the difference 
between the mean scores of the whole 100 ‘subjects in the following 
tests: 


(а) А and G, (b) C and D, (c) C and F, 
(d) Dand F, (e) Dand G, (f) F and G. 


Thence determine which pairs of means are significantly different, 
Use the means given in the answer to Exercise 3 (a). j 
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17. Using the method of Section 5.iii (b), formula (14 A), examine 
the significance of the difference between the following pairs of means: 
(a) Test A: mean of subjects 1-25 and 26—50. 
(b) Test С: mean of subjects 1-25 and 26-50. 
(c) Test D: mean of subjects 1-25 and 26-50. 
(d) Test С: mean of subjects 1-25 and 26-50. 
(e) Mean of subjects 1-25 in test C and mean of subjects 1-25 in 


test G. 
(7) Mean of subjects 51—75 in test C and mean of subjects 51—75 


in test G. 
(g) Mean of subjects 51—75 in test A and mean of subjects 51-75 


in test D. 
(h) Mean of subjects 76-100 in test F and mean of subjects 76— 


100 in test G. 


18. Calculate the variance ratio between the two groups in Example 
14. Does this invalidate the conclusion that the means of the two groups 


are significantly different? 

19. In a school of 320 boys in a certain town, 45 % were absent 
through influenza in one month. In the same month in another town, 
39 boys from a school of 150 were absent for the same reason. From 
these samples, would you regard the prevalence of influenza as being 


equal in the two towns? 


csc 


Chapter УТ 
CORRELATION 


6.i. It frequently happens in experimental work that we 
wish to know the association between two variables, that is, 
to know to what extent one variable is related to the other. 
There are various methods of measuring such association, 
dependent on the nature of the variables, their types of dis- 
tribution, etc. When the two variables are numerical and 
normally distributed, the association, or correlation, between 
them may best be measured by a method known as the 
product-moment method. This is by far the most useful and 
theoretically satisfactory method of measuring correlation 
and much advanced statistical work is based upon product- 
moment correlation. It will therefore be considered first. 


6.1. Indescribing methods of correlation the two variables 
will be called X and Y. These variables will have means X 
and Y and standard deviations о’, and су. Since the various 
values of the variables will always be considered in pairs, an 
X with а Y, there will be the same number of X's and Y's 
in any particular case, і.е. №. In terms of these statistics a 
quantity known as the coefficient of correlation may be cal- 
culated: this coefficient is denoted by r. If there is complete 
positive correlation between X and Y ‚ r has the value 1; if 
there is complete negative correlation it has the value — 1, and 
incomplete correlation gives decimal values for » between 1 and 
—1. If there is no relation at all between the variables, v is 0. 

The meaning of the above statements may be illustrated in 
this way. Suppose the heights and weights of N people were 
measured: these would be our X and У and there would be one 
value of X and one value of Y relating to each person. Now 
suppose the tallest person was also the heaviest, the second 
tallest the second heaviest and so on, until we reached the 
shortest person who was also the lightest: in this case there 
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would be complete positive association between X and У and 
7 would be 1 (provided the relation between X and Y was 
exactly linear and could be expressed by the equation 
Y =a+bX). If, on the other hand, the group of persons was 
зо peculiar that the tallest person was also the lightest, the 
second tallest the second lightest and so on to the shortest, 
who would be the heaviest in this case, then there would be 
complete negative correlation between height and weight, and 
* would be —1 (with the same proviso as before). Complete 
correlation is very rare. Usually there is а general but not 
complete agreement between two variables, so that r is 


fractional. 


6.111. The general formula for r is quite simple. If (X-X) 
and(Y — Y)arethedeviations of corresponding values of X and 
У from their means (i.e. the pair of values corresponding to one 
person), then these two deviations may be multiplied together 
to give the product (Х-Х)(У- Y). If we add together all 
such products for the N persons, we obtain S(X — X)(Y — Y). 
The coefficient of correlation is then given by the formula 


_ §(X-X)(¥-Y) 
ae NOROS 


(16) 


The application of the name ' product-moment’ to this method 
may now be appreciated. The average deviation of any value 
of X from the mean may be described as the first moment about 
the mean, and similarly for the Y’s. The mean product of such 
deviations is similarly called a ‘product-moment *, The expres- 
sion S(X —X)(¥— Y)/N is called the co-variance. 

Crude test scores may be transformed into standard scores 
by expressing them as deviations from their mean in terms of 
their standard deviation. Thus a crude score of X yields а 


X-X 


standard score of and a crude score of Y yields a stan- 


Y-Y 


dard score of . Call these X, and Y, respectively. Then 


sx) 
- SA, 


T 
4-2 
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Formula (16) is the simple theoretical form. In practice con- 
siderations of ease in calculation necessitate modifications of 
this and we shall now consider some of them. 


6.iv. Product-moment correlation when N is small. 
If we have a small number of X's and Y 's, say less than 80, the 
calculation of r would be performed as follows. Write down the 
values of X and Y in two parallel columns in their pairs, i.e. so 
that the pair of readings in each horizontal row belongs to the 
same person. Next calculate two more columns, headed X? 
and Y?, by squaring the terms in the first two columns. The 
totals of these four columns will give us S(X), S(Y), S(X2) and 
S(Y*), and these data will enable us to calculate the means 
and standard deviations of X and Y by formulae (1) and 
(5 В). 

; Now instead of subtracting the mean X from each value of 
X and У from each value of У and multiplying the answers 
together, we shall get the total of products, or product-sum, 
by a method similar to that used in Section 3.vii for calculating 
the standard deviation. We shall multiply together the actual 
values of X and Y as they stand in columns 1 and 2, sum the 
products, and then subtract the product of X and Y at the 
end, after having divided the product-sum by N. The validity 
of this may readily be seen by the following simple algebraical 


proof: 
(X-X)(Y-Y)-xr-XY-YX4XY. 
S(X-X)(Y - Y) = S(XY)- XS(Y) - YS(X)4-NXY. 
S(X-X)Y-Y) S(XY) zS(Y) -5(Х) 
N N EN N 


XY 
= S&Y)_gy_yxe+x7 
N 
S(XY) ш 
N -XY 


Accordingly we form a fifth column, headed XY, by multi- 
plying together corresponding X's and Y's in the first two 
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columns. The total of this column is S(XY). We may then 
‘obtain r from the formula 


r= ————_.. (16 A) 


Example 17. Twenty pupils are given small tests in хайы 
metic and Latin, and the marks gained in each test, from a 


maximum of 10, are shown below. 


Pupil АІ В, (ОН 10) МЮ) та а = det I J 
Arith. 3 9 7 8 4 1 6 9 7 8 
Latin 1 8 4 10 6 5 5 3 8 ri 
Pupil Eo. D AW ew) PRQ: 10 № Ty 
Arith. 5 4 6 5 2 6 5 4 6 2 
Latin 2 6 5 9 4 5 7 1 3 5 


Calculate the coefficient of correlation between the two sets of 


marks, 
We construct the five columns, as described above, and 


obtain S(X), S(Y), S(X*), S(Y?) and S(XY). The actual 
arithmetic is shown. 


X Y БО On од 

3 1 9 1 3 

9 aum os 2 TOTNM 

7 ПОЛОСУ 28 Ecc my um 

8 10 64 10 80 quei: 

4 @ ng gg) “Bs - 62 

1 5 1 25 5 0 

6 5 36 25 30 By formula (5B) 

27 

а ое RE 
8 7 64 49 56 880 

EU VU Е Оо = (5-52) oyu 
4 Gel Ge зве 

6 5 36 25 30 S(XY) 5% 5555 

5 Dope En Via. my 20 

2 4 4 16 8 Hence, by formula (16А), 

E ME 20 z _ 29-75—5:35х 5:20 

а 0 ia 1 4 Т 9249 X 2-441 

6 3 36 9 18 — 0:353. 

2 5 ОБО, 


| 


о 
S 
Es 
e 
Ж 
[1 
z| 
e 
a 
EJ 
e 
e 
oOo 
&| 
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6.v. Ifthe actual values of X and Y are large, а good deal 
of arithmetic would be needed to obtain the third, fourth and 
fifth columnsin the above method. Provided the ranges of the 
X's and Y's are fairly small, a modification of the method may 
be made by writing down the values of X and Y as they 
deviate from suitable arbitrary origins. Such arbitrary origins 
would be chosen near the middle of the range of each variable 
and the deviations of each X and Y from these would have to 
be written with due regard to sign. For example, if the values 
of X varied from 61 to 78 we might take 70 as arbitrary origin. 
In this case a reading of 61 would be written as — 9,1.е. 61 — 70. 
In the same way, 68 would become —2, 78 would be 8 and 
70 would be 0. 

In this manner we could replace our original X and Y 
columns by two other columns recording the deviations of the 
X’s and Y's from their arbitrary origins. Call these X' and 
Y'. From these the columns of squares and.products may be 
obtained as before. In employing this method it is advan- 
tageous to set out the plus and minus values of X’ and Y' in 
separate columns, as shown in Example 18. 

In this case S(X")/N would give the difference between the 
arbitrary origin and the true mean of X , 80 that using the 
notation of Section 2.iii we may write S(X^)|N = D,, and 


S(Y*)IN = D,. Similarly, from Section 3.viii, with modifica- 
tions, we get 


= (р) and = 902-2). 


Formula (16 A) is accordingly modified to the following in this 
case: S(X' Y' 
A) =DD; 
=—_—_ (16 В) 
7,0, 
The calculation of r by this method is exemplified below. 
Example 18. Find the correlation between the two test 
Scores given below for 10 subjects. 
Subject 1 2 3 4 5 6 7 8 9 10 


Test А 19 25 17 20 96 30 29 21 93 .94 
Test B 145 151 140 144 138 140 142 150 149 150 
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Test A has a range from 17 to 30, so that a convenient 
arbitrary origin will be 25. For test B a convenient origin will 
be 145, as the range is from 138 to 150. 


x ig хе уз KYA 
quá RM SSS 
+ - + E + - 

— 6 0 36 0 0 
0 6 0 36 0 
— 8 = 5 64 25 40 
= 5 = 1 25 1 5 
T = 7 1 49 = 7 
5 Ea ih 25 E 
4 =. № 9 -12 
Ел 5 16 25 —20 
= 2 4 4 16 — 8 
= 1 5 1 25 - 5 
10 — 26 20 —21 188 211 45 —77 
—16 —1 —32 
D,=—16/10=-16; D,=—1/10=—0-1, 
с, = (188/10 1:68) = 4:030, 
су = 4(211/10— 0-15) = 4-592. 
—32/10—(—1-6 x —0-1) 
= А = — 0:183. 
Ето 4:030 x 4-592 


6.vi. Product-moment correlation when N is large. 
If М is large, over 80, the foregoing methods of calculating r 
become very laborious and it is usual to curtail the arithmetic 
involved by the use of a tabular method. For this purpose we 
construct what is known as a correlation table. The construc- 
tion of such a table is most easily understood by reference to 
an example such as is given in Table IV, which illustrates the 
correlation between two tests, X and Y. 

In this table each of the test scores has been grouped for 
working purposes (see Section 2.iii). For test X the convenient 
group unit was 2 and accordingly the Х groupings are written 
along the top of the table. Test Y had a larger range and the 
group unit chosen was 10: the Y groupings are written at the 


left-hand side of the table. 
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We next proceed to make a spot diagram showing the distri- 
bution ofthe pairs of scores. For example, one subject scored 
Oin test X and 55 in test У; accordingly a dot is put in the 
square, or cell, with the X grouping 0, 1 above it and the Y 
grouping 50-59 on the left of it. Similarly a dot is made in the 
appropriate cell for each other pair of scores. Hence each dot 
represents a score both in test X and in test Y and the total 
number of dots will be N, the number of subjects who took the 
tests. The number of dots in each cell is counted and the 
number written in the cell. (If the data are on cards, the cards 
may be sorted into their appropriate groups and counted, thus 
obviating the necessity of making a spot; diagram first.) 

The total number of observations in each horizontal TOW, ог 
array, is then found and recorded on the right of the table. 
"These will form a column headed fy, which will give the grouped 
frequency distribution of Y. Similarly, at the foot of each 
column of the table the total number of observations is re- 
corded, giving a horizontal row for f, or the grouped frequency 
distribution of X. The totals of this last column and row, i.e. 
S(f,) and S(f,), should be the same and equal to N. 

We then choose an arbitrary origin for each test, call the 
corresponding group 0, and then number the groups on both 
sides as we did in Section 2.iii. This will give us a column on the 
right headed ‘y’ and a row at the top of the table which will 
be ‘a’. The means and standard deviations of both X and Y 
in working units may now be calculated, using the method of 
Section 3.viii. This is most conveniently done at the side of the 
table (see Example 19). There is no need to obtain the true 
means and standard deviations of the tests—indeed it is im- 
portant to keep all the calculations in working units through- 
out. Hence we calculate D, and D,, using the notation of 
3d 2.iii, and c, and о’, using the notation of formula 

There now remains the problem of finding the sum of the 
xy products. Reference to Table IV will show how thisis done. 

Starting with the top horizontal array, write in each cell the 
product of the frequency in that cell and the z-grouping above 
it. For instance, the frequency in the first cell is 1 (written in 
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the bottom right-hand corner) and the z-grouping for that cell 
is — 6; hence we write — 6 in the top right-hand corner of the 
cell. Continue this for each cell containing observations. Then 
add' these small top right-hand corner numbers for each hori- 
zontal array and record the totals in a column on the right of 
the table. Head this column 7; y indicating the total of the 
z’s for each y. As a check on the arithmetic so far, sum this 


- Taste IV 
x 
а Е a) St toe a бе Se 
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column, and the total of it, Z(T;, у), should be equal to У( fah 
as was found in calculating the mean of v. ‘ 
Finally, we make one further column on the right of the 
table by multiplying together the entries in the y and Т.у 
columns: this new column will be headed yf, у. Sum this | 
column to obtain X(yT; и). This gives us the sum of the zy 
products. Care should be taken with the signs in all these 


calculations. 
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We have now all the necessary data and the correlation 
coefficient may be found by substitution in the modified 
formula: X(T, y) 

S 7 EAT GI D, D» 
N 
pos CRAVE Y, (16 C) 
CECE à 
The whole of the arithmetic involved is shown in the following 
example. 


Example 19. Calculate the coefficient of correlation between 
the tests X and Y in Table IV. (See р. 59 for working.) 


6.vii. The diagonal summation method. There is an 
alternative method, which is preferred by some computers, of 
calculating r from a correlation table of grouped data. The 
correlation table is constructed as in Example 19 but the small 
produot figures in the top right-hand corners of the cells are 
omitted. As in that example the columns for fy У, f, y and f,y? 
are written, and also those for f,,, v, f, and 7:27. From these 
are calculated two quantities which we shall denote by 4 
and B. These are obtained as follows: 


A = xq, Por 
B= zoo Sea 


We now construct a further column by summing the total 
frequencies in each diagonal of the table. In order to obtain the 
correct sign for r, it is important that these diagonals shall run 
from the corner of the table where both variables have low 
scores to the corner where both have high scores; in Example 
19, for instance, the diagonals will all be parallel to that 
running from the top left-hand corner to the bottom right- 
hand corner. This column we head fa 

We then proceed as though we were finding the standard 
deviation of this column, i.e. an arbitrary origin is chosen and 
numbered 0 and the frequencies numbered positively and 
negatively on the two sides of this origin, yielding a column 
headed d. Two further columns, headed fad and f,d?, are 
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obtained in the usual way.and summed. From these totals we 
then obtain a third quantity which we shall denote by C. This 
is given by р 
сезт) Әр 


0 15 36 st 32 “35 335 
f= -6 — -12 -21 -i6 -n 0 15 18 9. 8 5 $5-00--2) 
z -6 -5 4 -з-2- 91 Pa 7e CS EQUES: 
te - Je 148, Баро 16. о ЗТ оо 


1 
А v ty ОТ в= 335-21 
25 -5 -5 fà [ a ea 
= 335-4- 
1 z33059 


40 -:0 -2 10 7 ija 4 [за 
å A N 
12 -12 -1 12 1 5 5. 1 
о 0 0 2 з [з | з [ло 1 
22 22 1 2 1 3 6 9 2 1, ql 
48 24 2 12 1, 1 2 6 2 
да fad да 
d 8 On НОЕ ene Voce ute 
80 20 4 & ries: ди 
N 
10 20 5 € 1 1 з n -2 -2 46 
108 18 6 3 DS eC) UD 


Bu 128 109 оо 
ps 20 о 
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Mo! Ho 
si? 
а= 01-19 
Er Gm S) o 
241539 


p a 1520-330.59— 277.96 Tus 4 16 


2481539533000 , 100 00 278 
52802 zu 
BE НЫ 
= 0660 o= 218 0 
2918-004 
217796 


The coefficient of correlation may then be obtained from 
the formula 
_A+B-0 


= > Дав) ` (16D) | 


(It is left to the student to relate this formula to formula (16). 
He should have no difficulty in doing this if he bears in mind 
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thatif X and Y represent deviations of the two variables from 
their means, then 1 
28(XY) = S(X2) - S(Y2) -S(X — У); 
and that (X — Y) is constant for any diagonal.) 
The whole of the arithmetic involved is shown in the 
following example. 


Example 20. Calculate the coefficient of correlation of the 
data in Table IV using the method of diagonal summation. 

Since in this case the f; column will have to be written on the 
right of the table, itis convenient to write the f, and associated 
columns on the left and the f, and associated columns above 
the table. The setting out of the working is shown on p. 60. 


6.viii. Significance of product-moment correlation: 
standard error of r. The standard error of r is usually 


calculated from the formula 


Aer 
Se. ofr = mi (17) 


As usual, the probable error is 0-67449 times the standard 
error. 
This formula is approximately true when N is large and the 
values of r are small or moderate in size. In such cases the 
correlation may be taken as differing significantly from zero if 
ris more than twice its standard error, or more than three times 
its probable error. 
In small samples, however, or with very large values of r, the 
above formula is not true and the significance of r should be 
assessed by the t method. For correlation coefficients 
JN —2) 
t=r Aa -5) : (18) 
calculate t for each value of r that is 
gives graphically the criterion of signi- 
£ N from 50 to 270. The significance of 
be found from the following table, 
Ref. 2), Table V A, or Fisher and 


It is unnecessary to 
found. Appendix B 
ficance for r for values o 
r when N is 50 or less may 
extracted from В. А. Fisher ( 
Yates (Ref. 5), Table VI. 
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TABLE V. Values of r for P = 0-05 


n-—-N-2 
n T "n T 
1 0-997 14 , 0-497 
2 0-950 15 0-482 
3 0-878 16 0-468 
4 0-811 17 0-456 
5 0-755 18 0-444 
6 0-707 19 0-433 
tí 0-666 20 0-423 
8 0-632 25 0:381 
9 0-602 30 0-349 
10 0:576 35 0:325 
1l 0-553 40 0:304 
12 0-532 45 0-288 
13 0-514 50 0-273 


In using the above table n is 2 less than №, the number of 
pairs of observations in the correlation. If the calculated value 
of r is as big as or bigger than the value given in the table for 
the appropriate value of n, the correlation differs significantly 
from zero, i.e. it indicates a real degree of association between 
the two variables. 

The use of these various assessments of significance is 
illustrated in the following example. 


Example 21. The scores in two tests оп 100 subjects are 
correlated and the value ofr obtained is 0-35. Is the correlation 
significant? 

(a) By formula (17): 


S.e. ofr = 


1— 0:12 
— = 0:08775. 


0 
Hence r is significant, since 0-35 is just about 4 times its 
standard error. 

(6) By formula (18): 
0:35 J(98) 
р, 
У\—0-1925) = 979. 

This value of t is much larger than that given in Table III, 
р. 37, for n = 30, so that it must be even larger than that for 
т = 98; hence r is significantly larger than zero. 
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(c) By consulting the graph in Appendix B it oai be seen 
that for N = 100 а value of r of 0-193 is significant. Hence our 
present value of 0-35 is definitely significant. 


6.ix. Significance of the difference between two corre- 
lations. We sometimes wish to know whether the correlation 
between two variables is different in two different samples. To 
do this we maké use of a method devised by Fisher (Ref. 2) 
which entails transforming 7 into a quantity which he calls 2. 
This is given by the formula 2 

z = Нов, (1+7) —108,(1—")). 
Once again there is по need to make the actual calculation as 
Table УТ gives values of 2 corresponding to values of r up to 
0-500, and vice versa. For values outside the range of this 
table the student is referred to Table V B in Fisher (Ref. 2), or 


Table VIL in Fisher and Yates (Ref. 5). 


TAzLE VI. Conversion of r into z and z into т 


Forz Forr 
т ааа 2 subtract: 

0:000-0:114 0:000 0:000-0-114 0:000. 
0:115-0:163 0:001 0:115-0:165 0:001 
0:164-0-194 0-002 0:166-0-196 0:002 
0-195-0-216 0:003 .0:197—0:220 0:003 
0:217-0:235 0:004 0-221—0-240 0-004 
0:236-0-251 0-005 0-241—0-256 0-005 
0:252-0:265 0:006 0:257—0:271 0-006 
0:266-0:277 0:007 0:272—0-285 0-007 
0:278-0:288 0-008 0-286-0-297 0-008 
0-289-0-299 0:009 0:298-0:309 0:009 
0-300-0-309 0:010 0:310-0-320 0:010 
0:310-0:318 0:011 0:321—0-330 0-011 
0:319-0-327 0:012 0:331—0:339 0:012 
0:328-0-335 0:013 0-340—-0-348 0:013 
0:336-0-343 0-014 0-349-0:357 0:014 
0:344-0-350 0:015 0-358-0:365 0:015 
0:351-0-357 0:016 0-366—0-373 0:016 
0:358—0-364 0:017 0:374—0:381 0:017 
0:365-0-371 0:018 0:382—0-389 0:018 
0:372—0-377 0:019 0:390—0-396 0-019 
0-378-0-383 0:020 0-397—0-403 0-020 
0:384—0:388 0:021 0:404—0-409 0:021 
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TABLE VI (continued) 


Forz Forr 
iP add 2 subtract 

0:389-0:393 0-022 0-410-0-416 0:022 
0-394—0-399 0-023 0-417—0-422 0:023 
0-400—0-404 0-024 0-423—0-428 0:024 
0-405—0-409 0-025 0-429-0-434 0-025 
0-410—0-414 0-026 0-435-0-440 0-026 
0-415-0-419 0-027 0-441-0-446 0-027 
0-420—0-423 0-028 0-447—0-452 0-028 
0-424—0-428 0-029 0:453-0:457 0:029 
0:429-0:432 0-030 0-458-0-463 0-030 
0-433-0-436 0-031 0:464—0:468 0:031 
0-437—0-441 0-032 0:469-0-473 0-032 
0-442—0-445 0-033 0:474—0-478 0-033 
0-446—0-449 0:034 0:479-0:483 0-034 
0:450—0-453 0-035 0-484—0-488 0:035 
0-454—0:456 0:036 0:489-0:493 0:036 
0:457—0:460 0:037 0:494—0-498 0:037 
0:461-0:464 0-038 0-499-0-502 0-038 
0:465-0:467 0-039 0:503—0:507 0:039 
0:468-0-471 0-040 0:508—0:512 0:040: 
0:472—0:474 0-041 0:513-0:516 0:041 
0-475-0-478 0-042 0:517-0:520 0:042 
0:479—0-481 0:043 0:521-0:525 0:043 
0:482—0:484 0-044 0-526—-0-529 0:044 
0:485—0-488 0:045 0:530-0:533 0-045 
0:489—0-491 0-046 0:534—0:537 0:046 
0:492—0:494 0:047 0:538—0:542 0-047 
0:495—0:497 0:048 0:543-0:546 0:048 
0:498—0:500 0-049 0-547—0: 550 0-049 


То use this table, look up the value of r in the left-hand column and 


add to it the corresponding value in the second column, as 2 is always. 


bigger than r. To turn z into r, look up z in the third column and 
subtract the corresponding entry in the last column. 


For a value of N greater than 3, the standard error of z is 
1/40 — 3), and the standard error of the difference between 


two 2’s is 1 1 
Р вене) 


where N, and № are the numbers of pairs in the two samples. 
As usual, if the difference between the two z's is greater than 
twice its standard error, then the difference is significant, 
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Example 22. Two groups of children, one of average age 11 
and the other of average age 14, are given an intelligence test 
and an arithmetic test, and the scores are correlated for each 
group separately. The numbers in the groups and the correla- 
tion coefficients were as follows: 


11 year olds, М, = 43, г, = 0-48, 
14 year olds, №, = 39, r= 0:39. 


Can the correlation between intelligence and arithmetic be 
regarded as different in the two groups? 
From Table УТ, 


z = 0:523 and 2, = 0-412. 
The difference is 0-111. 


3 1 1 
S.e. of difference — JG — 0-230. 


Hence the difference between the 2? is just less than half its 
standard error and so cannot be regarded as significant. 


6.x. Mean of several values of r. Use may also be made 
of the z transformation to obtain the average of several values 
of r. Coefficients of correlation should not be regarded as 
ordinary numbers which may be added and divided, and due 
weight should be given to the number of pairs in each corre- 
lation, values calculated from larger samples being more 
important than values from smaller samples. 

То average several values of r, first transform each r into z 
and then multiply each z by N —3, where N is the number of 
pairs in the original r. Sum these products and divide the total 
by the sum of the (N — 3)'s; giving the mean value ofz. Finally 
transform this mean z back into 7, and this will be the correct 
value for the mean of the original correlation coefficients. The 
calculation may be conveniently tabulated as in the following 


example. 
Example 23. The same two variables are correlated in three 
different groups. The numbers in the groups and the values of 


csc 5 


66 CORRELATION ‚ [6.x- 


7 are given below. What is the average correlation in the three 
groups? 


. N T 
Group 1 23 0:41 
Group2  , 28 0-35 
Group 3 35 0-50 


Tabulate the work as follows: 


T 2 N-3 (N—3)z 
0-41 р 0:436 20 8:720 
0:35 0-365 25 9:125 
0:50 0:549 32 17:568 

77 35-413 
‚ Меап 2 = = 0-460. 


For z= 0-460, r = 0-430. (from Table VI). Hence the 
average correlation in the three groups is 0-43. 

The significance of an average r may be tested as though it 
had been calculated from S(N —3)+3 pairs of observations. 
In the above example the average r may be tested for signi- 
ficance as though it had beena single r calculated from 80 pairs. 
It is therefore definitely significant, although of its component 
r’s only that in group 3 is significant by itself. 


6.xi. Partial correlation. In some cases it seems pro- 
bable that two variables X and Y are correlated partly on 
account of the fact that each of them is correlated with a third 
variable, Z. For instance, it may be that there is a correlation 
between scores in an arithmetic exarhination and scores in a 
Latin examination, partly because ability to do both arith- 
metic and Latin is correlated with intelligence. In such a case 
we may wish to find the correlation between X and У quite 
apart from the influence of Z. This may readily be done by the 
method of partial corrélation. All we need to know is the three 
correlations, viz. y y., between X and Y, ryz between X and Z 
and ryz between Y and Z. The correlation between X ànd Y 
with the influence of Z removed is then given by the formula 


ИЖЕ ХЗ 
"rr = JA ra) а-я 5 


, 
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The symbol rxy z is read ‘the correlation Don X and Y 
keeping Z constant'. 

А table giving the values of (1 Z5) for all values of 7 to 3 
places of decimals may be found in Tables for Statisticians and. 
Biometricians, Table VIII, p. 20 (Ref. 3). 

Example 24. Three tests; A, В and C, were given to a group 
of students and the three sets of scores were correlated with 
each other, giving the following coefficients: 

тав = 0:66; тлс = 0:60; 75g = 0:40.. 
What is the correlation between A and В keeping C constant? 

From formula (19), 

0:66 — 0-60 x 0-40 
748.C = 10-36) Ju — 0:16) 
Ол ' 
— 0-7328 
= 0-57. 

The above formula and method may be extended to four or 
more variables. The formula for four variables is given below: 
the student is unlikely to need further extensions. 


T AB.O —" AD.C-TBD.C \ 
TAB.CD = A — p.c) ME; 3 c a х (19А) 
It will be seen that this partial correlation, the correlation 
between A and B keeping both C and D constant, requires the 
caleulation of three other partial correlations, each keeping 
only one variable constant. i М ue 


- 6.xii. Significance of a partial correlation coefficient. 
The si ignificance of а partial correlation coefficient may be 
determined by calculating t, which in this case is given by 


СА 
= aan c p=2), (20) 


ЕЕ 


where r, is the partial coefficient and p is the oen of | 


variables held constant. The num er of degrees of freedom for 
consulting the table of t (Table III) i is n = № —р 2, Thus in 
Example 24, if there were 39 students in the group, we haye 
r, = 0:57, p = 1,and N = 39., " 


" 7 5-2 
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adem A4(39—1— 2) 

_ 0-57х 6:0 

= 708781 

= 4:17. 
Reference to Table III with n = 39—1—2 = 36 shows the 
coefficient to be clearly significant. 


Hence (jm 


EXERCISES ON CHAPTER VI 


20. Calculate the coefficient of correlation by the product-moment 
method between: 


(a) subjects 1—25 in tests C and D, 
(b) subjects 26-50 in tests C and D, 
(c) subjects 51—75 in tests C and D, 
(d) subjects 76—100 in tests C and D. 


Use formula (16А). 


21. Check the results of Exercise 20 by using the method of 
Section 6.v and formula (16B). In each case take 25 as arbitrary 
origin for test C and 30 as arbitrary origin for test D. 


22. Caleulate the coefficient of correlation between each test from 
A to G inclusive and each other one for the whole 100 subjects. 
Employ the method of the correlation table, Section 6.vi, using formula 
(16C) and the grouping adopted in Exercise 3. Repeat using the 
method of diagonal summation and formula (16D). 


23. Using formula (17), caleulate the standard error of the values 
of r obtained in Exercise 20 for the correlations between test Р and 
the other tests from 4 to G. Hence determine the significance of these 
values of r and check by reference to (a) Table V and (b) the graph in 
Appendix B. 


24. Using the correlation coefficients given in the answer to 
Exercise 20, determine whether r,, is significantly different from 
Tox: Тов» Ter» Tros Tor 8nd 755. Use the method of Section 6.УШ and 
Table VI. 


25. From the correlation coefficients given in the answer to 
Exercise 20, calculate the following partial correlation coefficients: 


(a) between D and F, keeping C constant; 
.(b) between C and F, keeping D constant; 
(c) between F and G, keeping Е constant; 
(d) between E and F, keeping G constant; 
(e) between C and Z, keeping B constant. 


Chapter УП 
OTHER METHODS OF CORRELATION 


7.1. Theranking method. (a) Spearman's coefficient. 
It sometimes happens that the actual values of two variables 
cannot be accurately measured, although we are able to rank 
them in order of size or merit. In such a case the method 
of product-moment correlation cannot be applied, but an 
approximate coefficient of correlation may be calculated. If 
N pairs of variables are ranked, X and Y being ranked sepa- 
rately of course, and d represents the difference between the 
ranks of X and У for any one pair, then the coefficient of ranked 
correlation is given by the formula 


X(d? 
p=- T (21) 


This formula may be used whether the distributions of X and 
У are normal or not, but it only gives an approximate indica- 
tion of the association between the two variables and should 
never be used if it is possible to calculate r. The coefficient 
should not be employed in partial correlation, multiple corre- 
lation, factorial analysis or any other statistical process which 
is based on product-moment correlation. 

The method of ranked correlation is frequently employed 
because the calculation involved is simple for small samples. 
The method of calculation is given here for use in cases’ where 
the data are inadequate for product-moment correlation. 

The first step is to rank each variable, calling the best or 
biggest value 1, the second best or biggest 2, and so on. When 
two or more values of a variable are the same it is usual to give 
each the average rank. For instance, if there are two equal 
values for the 6th place, each is ranked as 63, the next rank 
being 8, since these two will have occupied the 6th and 7th 
places between them. Similarly, if there are three equal values 
for the 10th place, each will be ranked as 11, since they will 
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occupy the 10th, 11th and 12th places between them, and the 
next value will be ranked as 13. By this means, if there are V 
pairs of variables, each variable will be ranked from 1 to М. 
(This averaging of ranks introduces a further inaccuracy into 
formula (21), for the formula assumes that each ranking is 
different: it is, however, frequently impossible to avoid it in 
using this method.) 

The pairs of ranks are written down in two columns and a 
third column, headed d, formed by subtracting entries in the 
second column from corresponding entries in the first. If the 
correct signs are put down in this column, the total of the 
column, or X(d), should be zero. Each entry in the third 
column is then squared, yielding a fourth column headed d?. 
Finally, this column is summed to obtain £(d?), which is then 
substituted in formula (21). 

In Appendix С are given values of the reciprocal of 
6/N (N* —1) for values of N from 10 to 60. The use of this, as 
explained in the Appendix, will save the student a good deal of 
arithmetic, 

When М is moderately large, the significance of р may be 
tested by calculating t, which is given in this case by 

N-2 
t=p Top (21A) 
In referring to Table IIT, n = N. ; the number of pairs of ranks. 


Example 25. A class of 15 schoolboys was given an intel- 
ligence test and the master provided their order of merit in an 
entrance examination. What is the correlation between the 
test and the entrance examination? Below are the marks and 
order of merit of each boy. 


Boy ... on А В [e] D E Е GIEH: 
Order of merit o». 1 2 3 4 5 6 Ug 
Intelligence test score 22 19 6 18 20 16 11 9 
Boy ... ee di | WE J к ПЛ МОУ о 
Order of merit АРА 9 10 11 12 13 14 15 


Intelligence test score 15 12 10 7 13 12 8 


Here one of the variables is ranked for us and may be written 
straight down. As regards the marks for the test, boy A scores 


7.4] OTHER METHODS OF CORRELATION 


71 


most and is given the rank 1; H is next and is ranked 2; B is 3, 


and so on, finally giving us the second column. 


Order of 
Boy merit rank 


€ оо л о CU оо юн 


O*gbmNSSVInmagloot 


Test 
rank 


1 


NNI IM 
P-—'—]18x224 = 


a 


1 
= 


М сл Со Ф н со Н о но о о Н о 


== 


oe 


Е 


1—0-474 = 0:526. 


Examining the significance of this value of р we have by 


formula (21 A), 


em 0-526 | 


0-723324 — 


13 


2-29. 


This is just greater than the critical value of 2-21 given in 
Table III for » — 15, hence we may conclude that p is signi- 


ficant. 


(b) Kendall's coefficient. Suppose we have n objects 
ranked for each of two variables, X and Y. For convenience 
these rankings may be written horizontally as below: 


A He @ 
х 1 2 3 
Y 1 см 

I UT Le ES 
ax $ my am 
УИ 7 


D 


4 
2 


L 
12 
9 


E 
5 
3i 

M 


13 
15 


Е 


6 
8 


N 
14 
14 


а 
Tí 


11 
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There are 15 objects ranked here for X and У; and for sim- 
plicity, though this is not necessary, the X rankings are 
arranged in ascending order. 

The essence of Kendall’s method, which yields a coefficient 
of ranked correlation called 7 (tau), is to compare each pair of 
ranks with each other pair and to allot to each a score of 1, 0 
or —1. Let X; and X; be the ith and jth rankings for X, where 
Х is to the right of X,. If X; is smaller than X; a score of 1 is 
. allotted to that pair: if they are equal, the score is 0, and if 
X; is the larger, the score is — 1. Call the score for this pair a;;. 
Repeat the process for the Y rankings and call the score for 


the Y,Y; pair b,;. With objects there will be a total of (5) 
pairs. For each pair, multiply a;; and b;; together and sum the 
products over the whole (3) pairs to obtain X(a;;b;;). Then 


т, the Kendall coefficient of ranked correlation, is given by 
the formula Lo 2(a;;5,;) : (22) 
3n(n—-1) 

This is the value of the coefficient when there are mo tied 
rankings. When there are ties, as in the Y rankings above, the 
denominator of formula (22) needs to be modified. There may 
be several sets of tied ranks in one row, perhaps 2 ranks being 
tied in one set, 3 in another, and so on. If tis the number of 
ranks tied in any one set, calculate 14(/— 1) for that set. 
Repeat for all sets of ties in that row and'add all the results. 
Call the total U, for the X row and U, for the Y row. Then the 
coefficient, when there are ties, is given by 


2(a;55;;) 
т= : (22A) 
vinn- 1) -U,] [3n(n — 1) — 0} 
Example 26. Calculate т for the data at the beginning of 
this section. 
First compare the ranking for A with that for B. In the 
X row, 1 is less than 2 so that a; = 1. In the Y row, 1 is less 


£3 


than 5, so that b;; is also 1. Hence for the pair AB, 


Gb; =1х1=1. 
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Next compare A with C, A with D and so on, until A has been 
compared with each of the others B to O. Add the 14 values of 
a,;6;; thus obtained and record the total. We have now finished 
with the A rankings. Next compare the B rankings with each 
of the 13 others, С to О, in the same way. Some of these will 
be found to be negative, e.g. for BD, a;; = 1 and b;; = — 1, so 
that a,;b;; = — 1. Also, for the pair ЕН, for example, a,; = 1 
and 6;; = 0, so that a;;b;; = 0. Listing the various sub-totals 
we find the following: 


Compared Sub-total 
with remainder (aibi) 
A 14 
B vi 
с —4 
D 11 
Е 9 
F 3 
G 6 
H 7 
1 —2 
J 5 
K 1 
L 3 
M —2 
N —1 


Add the various sub-totals which gives us У(а,,6,,), which in 


this case is equal to 57. 

There are no ties in the X row, so that U, = 0. In the Y row 
there are two sets of ties. Z and H are tied: for these / = 2 and 
H(t—1) =1. 0, К and О. are also tied: for these t = 3 and 
102—1) = 3. Hence U, — 1--3 — 4. Substituting in formula 


(22.A) we find 57 


7 = 41050054} - 0954. 


(For the same data, as the student may verify, p = 0-717. It 
will be found that as a rule the numerical value of 7 is less than 
that of p.) 

In the above example the X ranking was written out in 
order. This is not essential, indeed actual ranking is not neces- 
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sary; for instance, below are the scores of 7 subjects in two 
tests, X and Y: 


A B [0] D E F G 
x 45 51 52 47 60 48 61 
Vw 13 15 12 10 16 13 18 


Pairs of scores may be compared and values for a;; and 5;; 
allotted as usual, dependent simply upon whether the second 
score in each pair is larger, equal to or smaller than the first. 
For the above data the student should find that т = 0:586. 
If now a further subject, H , were tested and obtained scores 
of 57 in X and 17 in Y,7 for all 8 subjects could be calculated 
by comparing the scores of subjects A to G with those for H 
and adding thesub-total for a,;b;; thus obtained to the X(a;;0;;) 
for the 7 subjects, and remembering that nin the denominator 

, of formula (22 A) is now 8 instead of 7. The student may verify 
that for all 8 subjects, т = 0-618. 

This illustrates a very useful property of Kendall’s method 
of calculating the coefficient of ranked correlation, viz. that 
additional data may be added and 7 calculated without having 
to re-rank each time, as is necessary in Spearman's method. 
If our data were in a time series, therefore, a fresh pair of 
readings being made each week, say, we could calculate a 
running value for т and see how it varied with time. 


7.11. Significance of т. To test the significance of т we 
first need to calculate the variance of X(a;;6;;), which we will 
call var У. When there are no tied rankings this is given by 


var> = n(n—1)(2n- 8) y (23) 
18 te 


Take the square root of this, which will give the standard 
deviation of У, and then if the ratio of 2(a;;6;;)— 1 to Avar X 
is greater than 1-65, the value of 7 is significant. 


Example 27. Test the significance of a value of 7 of 0:667. 
There were 10 pairs of readings and X(a;;b;;) was 30. 


* See Appendix F for a table to help in the calculation of var X. 
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From formula (23), 
var > = 10x 9 x 25/18 = 125, 
Avarà = 11-18 
2(a5b,)-1  30—1 _ 
ууа 1118 . 
This is greater than 1-65 so that r is significant. 

When there are tied rankings in either or both rows, the 
expression for var X is more complicated. Suppose there are 
ties to the extent of і, &, etc., in the X row; for each set 
calculate ¢(t— 1) (2t+ 5) and sum the sets to obtain 

Zit(t— 1) (2t4- 5). 
Similarly for the Y row, calling the ties 4, из, etc., we obtain 
Zu — 1) (20+ 5). Other adjustments involving the values of 
t and u have to be made, so that the complete expression for 
var > when there are ties present is 
var = {n(n — 1) (2+5) — Et(t— 1) (2+5) 
= Xu(u — 1) (2и-+5)} 


1 
Е) {2t(t—1) ((-2)) щи 1) (w—2)} 


As za (242—1) (Zu(u—1)).* , (23A) 


Using this the significance of may be tested as before. 
7.111. Partial rank correlation. Using the Kendall 
coefficients of ranked correlation it is possible to calculate 
partial correlation coefficients. It happens that the formula 
is similar to formula (19) for partial product-moment correla- 


2-59. 


+ 


tion, in fact Тав T40:Tpc 

E то} тво’ 
As yet no tests for the significance of a partial 7 are known. 
No similar expression using Spearman coefficients is available. 


7.іу. Biserial correlation. Observational data some- 
times make it impossible to calculate either product-moment 
or ranked correlation coefficients. For example, a test may be 
applied to a group of subjects about whom the only other in- 
* See Appendices F and G for tables to help in the calculation. 


+= 
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formation we possess is that each of them has either passed or 
failed a particular examination. In such a case, after making 
two assumptions, we may calculate a form of correlation 
coefficient known as the coefficient of biserial correlation. This 
we shall denote bis. r. The two necessary assumptions are: 

(1) that the dichotomous variable is normally distributed, 
and 

(2) that the regression of X on Y is linear (see Chapter Үш). 
Ifboth these assumptions are deemed justifiable and if we have 
more than 80 observations, so that the data may be grouped, 
the bis.r coefficient may be calculated as follows. 

Suppose X is the numerical variable, i.e. the test, and Y the 
dichotomous variable, i.e. the variable divided into only two 
parts. Choose a convenient group unit for X and divide the 
observations into the appropriate groups. Then construct à 
two-row, or biserial, table showing how the subjects who pass 


or failin Y fall into the X-groups. Such a table might appear 
as under: 


Test X 
=5 —4 -8 -2 1 0 1 2°3 4 5 в 7 Total 
vm а BID ON dern m go ai 
BEC O82 6G Om Mee We x m 2 
Aen Ca PY ОЕ ВВ Е я d 


As in calculating the standard deviation, we choose an arbi- 
trary origin for X and number off the groups on both sides, a8 
in Section 2.iii. This has been done in the table above. 

Now let us call the portion of the x’s which fall into the larger 
part of the Y distribution, x,. Then the mean of this row (which 
will be the Passes in the above table) will be z,. The mean of all 
the 273 will be z and their standard deviation с. АП these 
statistics may be calculated, in working units, from the table. 
The corresponding statistics for Y cannot be calculated 
directly, but assuming that we knew them, the coefficient of 
biserial correlation would be given by the formula 

2—2 


bis.r = UR. f (24) 
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Although we cannot calculate = directly from the table, 
y 


we may obtain it indirectly (on the above assumptions) from 
data given in Table II of the Tables for Statisticians and Bio- 
metricians (Ref. 3). This table gives a quantity $(1+«)* and 
the values of a function 2 corresponding to them. (This zis not 
to be confused with the z used by Fisher as a transformation 
for r, as in Section 6.ix.) The quantity }(1+«) in our case is 
equal to n,/N, where n, is the number of observations in the 
larger portion of Y and N is the total number of observations 
in the whole table. This is readily calculated and the value of 
2 corresponding to it may be found by interpolation. (The 
method of doing this may be best understood from Ex- 
ample 28.) 
We then have 9-Я c2 
y nN* 
The coefficient of biserial correlation is therefore obtained by 


calculating а= z from the two-row table and dividing this by 
2 


_the value of y = ZY obtained from the statistical tables. 


The sign of И r has to be determined by inspection. If, for 
instance, the mean score of the Passes is greater than that of 
the Failures, then the correlation is positive. 

This method of correlation should not be resorted to if it is 
possible to avoid it. In most cases it gives little more informa- 
tion than would be acquired by investigating the significance 
of the z-score means difference of the two Y-groups. In any 
case, the ordinary standard error of r does not apply to bis. r, 
and caution is needed in interpreting the coefficient. (The 
standard error of bis.7 is known only to a first approximation 
and the student who wishes to look further into this matter is 
referred to H. E. Soper, Biometrika, vol. x, p. 384, 1914.) 

The method of calculation is shown in the example below. 


* Tf a vertical is drawn at any point æ in a normal curve, the total 
area is divided into two unequal portions, if x deviates from the mean. 
- #(1+<) is the area of the greater portion. 


4v 
78 | OTHER METHODS OF CORRELATION [7 


Example 28..Calculate the coefficient of biserial correlation 
for the two-row table given above. 


The total column at the foot of the table gives the grouped 


frequency distribution of v, so that z and о’, may be calculated . 


by the method of Section 3.viii, 


f z fe feto 
1 —5 = 5 25% 
1 —4 = 4 16 
2 —3 — 6 18 
10 —2 —20 ‚ 40 
12 —1 —12 12 97 
Е = = — = 0:97, 
18 0 =47 0 50700 
20 1 20 20 645 :) 
4 2 28 56 в. = MB x09 
8 3 24 72 
Е Я 50 80 = 2-347, 
4 5 20 100 
3 6 18 108 
2 7 14 98 
100 144 645 
—47 
97 
The mean z, is Obtained in a, similar manner by multiplying 
2 by the entries in the top row: these ме may call f, 
A z Ut 
0 —5 0 
1 —4 — 4 
0 dj 0 
4 —2 – 8 
5 =i zd 
9 0 —17 
9 1 9 = Afr) 88 
10 2 20 z= E 2381.6 
6 3 18 2 
3 4 12 
4 & 20 
2 6 12 
2 7 14 
zm 105 
5 
B —17 


ПЕ 57 ° 
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From the biserial table, therefore, 
2—2 1:6—0:97 


=e 02684 
Oz 2-347 i 


Now we need to calculate “ne from the tables in Ref. 3. 
V 


nı = 55 and № = 100. Hence 
(1+2) = 55/100 = 0:55. 

From the table we find the following values of z corre- 

sponding to two values of 4(1+«): 
for 3(1+а) = 0:5477584, z= 0:3960802, 
for }(1+a) = 05517168, z= 0:3955854. 

Our value of 4(1 + о) lies between these two. Now the dif- 
ference between the two values of }(1+«) given above is 
0:0039584. This corresponds to a difference in 2 of 0:0004948. 

Our value of 4(1+«) is 0:55 — 0:5477584 = 0:0022416 above 
the smaller of the two values given above. It will be noticed 
that as 4(1+«) gets larger, z gets smaller: hence we have to 
subtract from the first z that part of the difference between the 
two z's which is proportional to 0:0022416/0-0039584, i.e, the 

‚ value of z corresponding to 4(1+a) = 0-55 is 
0:3960802 — 0-0004948 x 0:0022416/0-0039584 
= 0-3960802 — 0-0002802 
= 0:3958000.* 
Kad | 2 083908 _ 


Hence oy = Tai = 10:55 0:7196 
2—2 
Тоз 02684 
Therefore bis.r = na = 07196 = 0:373. 
бу 


This coefficient must be positive since Z,, the mean test score 
of the Passes, is larger than z, the mean of all the subjects, and 
зо must be even larger than the mean of the Failures alone. 


* This method of linear interpolation is not strictly applicable to 
the table but is sufficiently approximate for the present purpose. 
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Hence the Passes have a better score than the Failures and the 


relationship between the test scores and the examinational 
Success is a positive one. 


7.v. Fourfold correlation. When both variables are 
dichotomous the methods previously described cannot be 
applied. There are various methods of calculating fourfold 
or tetrachoric correlation coefficients, but their use is not 
advised. The association between two dichotomous variables 


is best investigated by the method described in Section 9.iii, 
р. 99. 


EXERCISES ON CHAPTER VII 


26. (а) Using the Spearman ranking method (formula Em 
calculate the coefficient of ranked correlation between the order 0: 
merit H and tests C and D for the 


(Rank the smallest order of 


ects obtained above, сощ 
orrelation between tests С ап 


D for subjects 1-95, Compare this with the value of r obtained in 


Exercise 20 (a). 


27. Calculate the Kendall coefficient; of ranked correlation for the 
data of Example 25, р. 70. 


28. Caleulate the Kendall coefficients of ranked correlation for 
the same data as in Exercise 26 above. 


Chapter VIII 
REGRESSION AND THE CORRELATION RATIO 


8.i. Graphic construction of regression lines. When 
we are considering the correlation between two variables, X 
and Y, we may draw a graph showing the mean values of y 
for regularly increasing values of x. This graph will be an 
irregular line which is called the observed regression line of y 
on x. Similarly the observed regression line of x on y is an 
irregular line showing the mean values of « for regularly 
increasing values of y. These lines indicate the law of change in 
the mean of one variable for unit change in the other, and if 
the lines are straight the regressions are said to be linear. 

Usually, owing to errors of sampling, the observed regression 
lines are rather irregular, but it may be possible to ‘fit’ straight 
lines to them and to show mathematically that the observed 
regression lines do not depart significantly from the fitted 
straight regression lines. 

A straight line has an algebraic equation which represents it 
in symbols; hence linear regression lines may be represented 
by the following equations: 


(у—9) = rou (1—2), (25) 


(2—2) = r 0-9 (25A) 


The former is the regression straight line of y on a and the latter 
the regression straight lineofzon y. In these equations r is the 
coefficient of correlation between X and Y. In (25) x is called 
the independent and y the dependent variable, and vice versa 
in (25 A). 

The angles these lines make to the horizontal and vertical 
respectively are measured by the expressions 

r% and ee 
Oz с, 

and these are called the coefficients of regression. 


csc 
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ТЕ» = 1, the two regression equations are identical and the 
regression straight lines coincide. If r = 0, the two lines are 
horizontal and vertical respectively and cross at right angles. 
The lines cross for any intermediate value of r, so that the 
larger the value of r, the smaller is the acute angle between 
them. The two lines always cross at the point z, y on the 
graph, i.e. the point indicating where the means of z and у Ше. 

In any fairly small sample the successive observed means of 
one variable for different values of the other are unlikely to fall 
exactly on a straight line, but if the regression does not depart 
significantly from linearity it is possible to draw a straight line 
which passes very nearly through the observed means. As an 
example, the regression straight lines of the data in Example 
19, p. 58, will be drawn and the discrepancies between the 
observed regression lines and these may be observed. 

At the right-hand side of the correlation table in that ех- 
ample we find two columns headed f,and TE respectively. І 
we divide the entries in the second of these by the corre- 
sponding entries in the first, we obtain the mean values of € 


corresponding to each y-group. In Example 19 these are 89 
follows: 


т 


y Í, 2 ay 
v zy TE 
—5 1 6 — 6:00 
cS 1 =) — 3:00 
—3 2 wis — 3:00 
_2 10 —10 — 1-00 
—1 12 — 8 — 0:67 
0 20 —16 — 0:80 
1 22 —10 — 0-45 
2 12 3 0-25 
3 8 8 1:00 
4 5 6 1:20 
5 4 1l 2:75 
6 3 10 3:33 


The last column gives the observed line of regression of оп у. 
In like manner the observed line of regression of y on z may be 
obtained by calculating a Т, „ column from the correlation 
table and dividing the entries init by the corresponding entries 
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8.1] REGRESSION AND THE CORRELATION RATIO 83 


in the f, column. This gives us for the observed regression of 
y on 2: 


T 
z fe т,.. a 
—6 1 E — 5-00 
—5 0 = LRS 
—4 3 —6 — 2-00 
—3 7 —3 — 0:43 
—2 8 2 0-25 
—1 21 0 0:00 
0 30 8 0:27 
1 15 30 ‘ 2-00 
2 9 29 * 3:22 
3 3 10 3:33 
4 2 10 5:00 
5 1 6 6-00 
* Note that there are no observations in the z = — 5 column. There 


will therefore be no entry in the last column for this group. Care must 
be taken not to record the entry as 0-00, as this would be taken as & 


point on the observed regression line. 


We now proceed to plot these points on a graph. Since we 
are working from arbitrary origins in this example, the two 
axes will be at right angles, crossing at the point x = 0, у = 0. 
The a line will be horizontal and we mark off along it equal 
divisions corresponding to the z-groups. Those to the right of 
the origin, or point where the axes cross, will be positive and 

"those to the left negative, and they are numbered accordingly. 
_ Similarly in the case of y, divisions above the origin are positive 


and those below negative. 
In Fig. 4 the points on the observed regression lines are 


plotted, those for the regression of x оп y being marked by 
crosses and those for the regression of у on æ by circles. The 
regression straight lines themselves, AA and BB, are also 
drawn. It will be seen that the crosses mostly lie closely about 
the line AA and the circles about the line BB. These two 
lines cross at the point where z = — 0-21 and у = 0:81, which 
were the mean values of v and y in working units found in 
Example 19. The acute angle between the lines indicates a 
fairly high correlation between a and y; it was found in fact 


that r— 0:666. 
6-2 
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The lines АА and BB are drawn as follows. The regression 
coefficients are first calculated from the data r = 0-660, 
7, = 1:818 and o, = 2-181. 


Fig. 4. Regression lines of data in Example 17. 


c. 
Hence gos = 0:666 x 1:818/2.181 = 0-5551 
y 


С, 
апа 7 = 0:666 х 2-181/1.818 = 0:7990. 
ka 


Substituting these in the regression equations, (25) and (25 A), 
wages (0—9) = 0-7990(#— 7), 
and (#—2) = 0-5551(y — y). 
Now we know that 
2 ——0:21 and 9 = 0:81. 
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Hence we have 
y— 0:81 = 0-7990(x 4- 0:21) 


and #+0-21 = 0-5551(y — 0-81), 
whence y = 0:7990x + 0:9778 
and x = 0:5551у — 0:6596. 


By substituting different values of x and y in these equations 
and calculating the corresponding values of y and =, the fitted 
regression lines may be plotted exactly. Two points are enough 
to give each line. і 
In the first equation, for example, 
ife=—5, y=—3-0172 
and і а= 5, y= 49728. 


These two points fix the line BB. 
Similarly the line АА is obtained from the second equation: 


ify=—5, x =—3-4351 
and ify= 5, х= 21159. 


8.1. Testing the linearity of regressions. Since the 
method of product-moment correlation is usually appropriate 
only in those cases where the regressions of the two variables 
on each other are linear, it is important to be able to ascertain 
whether in fact the regressions are linear in any particular 
case. Often the graphic method exemplified above suffices, 
but in cases of doubt (and always more satisfactorily) the 
linearity of regression may be tested mathematically. The 
process involves what is called the ‘Analysis of Variance’. 
(See Chapter x1.) 

Essentially the method is as follows. The student will re- 
member that the variance of x is obtained from the sum of the 
squares of the deviations of each individual x from the general 
mean, z. 'This sum of squares may be split into two portions: 

(1) the sum of the squares of the deviations of each = from 
the mean of its own array, summed for all arrays, and 

(2) the sum of the squares of the deviations of the means of 
the arrays from the general mean, z. In the latter each array 
mean must be weighted by the number of observations in it. 
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This splitting of the variance may be shown symbolically. 

Let f, be the number of observations ina y-array corresponding 
articular value of t, and y, be the mean of the y’s in this 
array. It may then be shown that 
SU -V* = 89-3, үу, (9, 1 

where У means а summation for all different values of x. 

Now if there is а linear regression of y on =, there will be ? 
mean y for each array which may be calculated from the 1°- 
&ression equation. Тре quantity Y[f (y, —5)?] may therefore 
be subdivided into two portions, one of which is the sum 0 
Squares due to linear Tegression and the other the weighted 
Sum of the squares of the deviations of the array means from 


(1) Ул, - y. 
(2) S(y yy. 
(3) [56—2) y gy 
_ Ве: 
These can be calculated as follows: 
EIN (0,2) (m.s 
—9)?] = xi 3) 
() 2049-09) a ue Ge 
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(2) S(y-g)? = zg- Cus, 
Df.) (foy) PP 

(3) [S(e—%) (y-P _ [Bot 2991] 
S(-z? = x( fat) Вы =; 


From these data we construct a table. In computing the 
number of degrees of freedom in each part of the variance, a 
is the number of y-arrays. 
T Degrees of Mean 
freedom square 


A. Total sum of squares: S(y—y)* N-1 


B. Total sum of squares within arrays: ais A-G 
(4-0) EDI 

C. Total sum of squares between arrays: а— 1 
У[1.(9.—9)*] - 

D. Sum of squares due to deviations from TUN О-В 
linear regression: (C — E) N 229 

Е. Sum of squares due to linear regression: 1 


Sz —z)* yt 

In this table entries B and D are obtained by subtraction as 
‘indicated, B by subtracting C from A, and D by subtracting 
E from C. The entries in В and D are then divided by their 
respective degrees of freedom, giving the two entries in the 
colum headed ‘Mean square’. 

*We next find the logarithms of the entries in the mean 
square column and multiply the difference between these 
logarithms by 1:1513 (to convert logarithms to the base 10 to 
Napierian logarithms and to divide by 2). This gives us a 
quantity called 2, which is tabulated in Table VI of Fisher’s 
book. (Unfortunately this is the third z we have had to use and 
should not be confused with those of Sections 6.ix and 7.iv.) 
C №0-Е 


enol a nnee < 


ies 
N- 


* See next section. 
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Tt finally remains to look up the value for z in Fisher's table 
corresponding to the number of degrees of freedom we have. 
There are two values of n required, n, and z,. The n, of the 
table is the number of degrees of freedom corresponding to 
the larger mean square, i.e. whichever of the entries В or D is 
the bigger, and т, is the number of degrees of freedom corre- 
sponding to the smaller mean square. у 

Ifthe calculated z is smaller than the z in Fisher's 5 % point 
table, then the regression of y on 2: does not differ significantly 
from linearity. 

The whole of the necessary calculation is shown in Example 
29. It must be remembered that this is an examination of 
the regression of У on g only: the regression of x on y should 
also be examined by an exactly similar method, interchanging 
х and y in each formula. $ 


Example 29. Examine the linearity of the regression of y on 
t for the correlation table given in Example 19. 

For the sake of Space the correlation table itself is not re- 
produced here. The first ten columns to the right of the table 
are copied from Example 19, p. 58, and were obtained as 
explained in the section preceding that example. 

From the calculations on P. 89 we may construct the 
analysis table. The number of arrays, а, = 12. 


Degrees of Mean 
freedom square 


A. Total sum of squares 475.39 — 99 
B. Totalsum of squares within arrays — 227.3033 88 2.5830 
C. Totalsum of squares between arrays 248-0867 11 
D. Sum of squares due to, deviations — 37.2467 10 37241 
from linear regression 
Е. Sum of squares due to linear 210-84 1 
regression 


2 = 1-1513 (log 3-7247 — log 2-5830) = 0-1830. 


We must now consult Fisher's 5 9/, point table ofz. We have 
nı = 10 and n, = 88. 

Now we see from the table that when n, = 19 and n, = 120, 
the 5 % point is 0-3032, hence the value for m = landa, = 88 
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"1980-85 = ds - 1009-616 = [A -*RY Ix 
001 
‘ув-отв = SLE _ 18 ae __ 5-25 
210-493 ( 00T ) 100—0) (z—«)g] 
— — LG 
5 M8X IG 
‘68-91% = 19:99 — тра = Vl gg = (h—fi)y 
$18 
паа 8 _ 
9L— Lp 
1969-818 IS | gee go 001 | 179 get 1%6 ГЕ = 00т 
0:98 0:9 9 9; 39 9 I $0 sI 09 Ot 9 £ 
0:08 0-3 от 5 8 т 5 000 05 ec It g * 
6666-66 EELEE от frac $ [3 08 05 TG 9 * $ 
FPPPE6 2202-6 65 9g sI [4 6 (di ud 75 8 $ 8 
0:09 0-5 08 т SI т ст SF ӯс 9 $ Z 21 
6861-2 1992-0 8 0 m= 0 og | 55 56 OLS Qe OT NN 55 
0 0 0 Wa ТСЕ АКО Ww |0 Sr o0 05 
3-0 35:0 5 а те g= 8 [4i dic 8 Ce iS Te 
1686-1 9857:0- ie- = = L OF 0c— 0 i= = n 
0-81 06— * 9- Gs di = $ SI 9 - 8I Qe “а 
= = = 0 0 g— 0 91 = [4! Gc qe c 
0:95 9— 95 e = I $5 QE 0g pes deco 
== еа | аа EA 2 л "asp | earn ат n ар 
g 


Jiii- ' 
90 REGRESSION AND THE CORRELATION RATIO [8 


must be bigger than this. Accordingly our calculated Ye 
of z is much smaller than the 5 % point and we conclude 


the regression of y оп z does not depart significantly from 
linearity. 


5 case is the entry in the ‘Mean 
Square" column in the D row di 


ample 29, for instance, the variance 
ratio is 3-7247/2-5830 = 144. In this c 


P asen; = l0andn, = 88. 
In Table У of Fisher апа Yates for 
find an entry of 1-83 for the 


% point ratio 
ion, but there 
Mc the regression 
does not depart significantly from Which и 
our findings in the previous Section, 
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8.v. The correlation ratio. If the regressions of the two 
variables on each other are non-linear, the degree of associa- 
tion between the variables cannot be measured by ordinary 
correlation. There might be a real relationship and yet the 
coefficient of correlation would be zero if the regressions were 
semicircular and symmetrical, for example. In such cases а 
Measure of association may be obtained by calculating the 
correlation ratio. 

There are two correlation ratios for each pair of variables 
and they are denoted by yz and zy. The formulae for these 


are simple: ay Cie n 5[/.(9.-9)] (26) 
В. gio By 
. ci УД 

ot Е (26 А) 


In these formulae c, signifies the standard deviation of the 
means of the z-arrays and 07. the standard deviation of 
the y-arrays, so that 7 may be seen 
standard deviation of the means о 


deviation of the whole sample. 
Tt will be seen that the requisite data for the calculation of 


Nyx have already been obtained in Example 29, and if the 
student has examined the linearity of the regression of x on- 
y he will also have the data for calculating ху. 

Example 30. Calculate 7, from the data obtained in 
Example 29. 


From that example we see that 
[Л.Я] — 248:0867 and S(y—9)? = 475-39. 


248-0867 
Hence ] Ne = 47539 . 0-52186, 
EET п = A0:02186 = 0:7224. 
data necessary for the examination of the 


8.vi. The 
linearity of regression may now be expressed in an alternative 
form. We have S(y—y? = №, 
51/9] = N15, o5, 
[9(&—2) -DË _ N1202. 


ee RS 


S(x—2)? 


to be the ratio between the 
f arrays and the standard 


wi 
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Also the z of Section 8 iii may be expressed as 


= 2 _ y2 
2=1-1513log Ü —9) (*—7) 


(2—2) (1—2) * 
T£we substitute п this expression the values we have obtained, 
we find that 


nmi 2. 0. 2 
- 
6-89075 
= та. 
= 11513106 (1.4419) 
= 0:183, 
This result is identical with the value of z calculated in 
Example 29, 


EXERCISE ON CHAPTER VIII 


xercise 22, examine the linearity 
her, 


Chapter IX 
x2: CONTINGENCY:GOODNESS OF FIT 


9.1. The statistical methods already described have mostly 
been applicable to quantitative numerical data only, and 
usually only to data which are at any rate approximately 
normally distributed. It may happen, however, that the 
available data are qualitative or quantitative only in the sense 
that we know the number of cases falling into different cate- 
gories. In such instances the methods which have been 
explained in the previous chapters cannot be used but there 
are other methods which are appropriate. 

"These methods depend chiefly on statistic known as д. 
The mathematical derivation of this statistic is difficult and 
cannot be described here. However, the distribution of it has 
been worked out and tables are available showing the fre- 

lues of y? are exceeded and 


quency with which different và [ 
also the values of x” corresponding to particular frequencies. 
Reference to the appropriate tables will be made later in 


this chapter. 

"Use may be made of y? in the investigation of a number of 
different problems, but the calculation of it is essentially the 
same in each case. IfOisthe observed frequency та partieular 
category into which a variable may fall and E is the frequency 


which would be expected to fall in that category on some 
hypothesis, then x? may be found by dividing the square of the 


difference between О and E by Е and summing these quotients 


for all categories into which the variable falls. In symbols, 


еб |. (27) 


This is the general formula and the calculation of it in 
specific cases 1] be decribed in due course. 
2 we need to know the number of degrees of 


Having found X snow ti 
freedom available for calculating it in each particular case 


94 X*:CONTINGENCY:GOODNESS OF FIT [9.i- 


before we can make use of the x? tables. Rules for finding the. 
number of degrees of freedom will be given in each instance. 
Consultation of the tables will then indicate the probability, 
P, of a calculated value of х? being exceeded as a result of 
random sampling. If this probability is less than 0-05 (i.e. 
19:1), then x? may be regarded as showing that the observed 
data depart significantly from the hypothesis which is being 
examined. This hypothesis may be that a variable has a par- 
ticular type of distribution or, more frequently, that there is 
no association between two variables. This latter hypothesis, 
known as a null hypothesis, assumes that two variables are not 
associated: if we can disprove a null hypothesis, it follows that 
the two variables must be 


associated. Such a procedure is 
usual in the investigation of certain problems of association 
which will now be described, 


ble. An example 
categories and 4 y-categories. 


ay bl ЕЯ 2, ite 


ny, 


Я s Nz, n 
It will be seen that the table is divided into 5х4 = 20 rect- 
angles or cells. The total frequency in each -category is given 
at the foot of the columns and will be Ws п, ete, Similarly 
the total frequency in each y-category is given at the end of the 


z, N 
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horizontal rows and will be n, ny, etc. The total frequency of 
each variable, or pairs of variables, wil as usual be N. In 
general, if there are r rows and c columns, there will be x x c 
cells. Each cell may be referred to by the х and y categories 
into which it falls: e.g. the cell on the third row down and in the 
second column from the left may be referred to as the ta, Уз 
cell. 


The contingency tableis compl 
frequency in each cell. For instance, we count how many 
observations in the v, category are also in the y, category and 
write down the number in the 2; , Y1 cell, and so on. From the 
completed table we may calculate a coefficient known as the 

tingency and denoted by С. To do 


coefficient of mean square con: 
this we have first to calculate the frequency which would be 
is, i.e. on the assump- 


expected in each cell on the null hypothes: 
tion that the two variables are not associated with one another. 
This is done quite simply as follows: ifn,,is the total frequency 


in the v, column, and ny, the total frequency in the y, row, then 
the frequency that would be expected in the 2;, y, cell on the 


null hypothesis is 
УР Nre X уг 


eted by filling in the observed 


By substituting each of the values ofc and rin turn, we obtain 
the expected frequency in every cell. 
t the expected frequency from the 


The next step is to subtrac 
]l: the resulting values of (0 — Е) 


observed frequency in each ce £ 
| are written in each cell with due regard to sign, and the arith- 


metic may be checked at this stage by observing that the total 

of (0 — E) is zero for each row and each column. 
Next square (0 — Е) and divide the square by E for each cell. 
of quotients which is conveniently written 


This gives а series 2 
E ht of the table: there will ber x csuch quotients. 


down on therig 3 
Finally this column of figures is added, giving X^. 


From this the coefficient of contingency, С, is given by the 
formula x2 
CS se (28) 


[Sometimes an intermediate step is inserted in the definition 
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of C. x is called the ‘square contingency’ and if we divide it 
by N we get the ‘mean Square contingency’, which is denoted 


by $2. Then e 
C= Nees (28 A)] 


If the null hypothesis is correct, О and E will be equal for 
each cell (apart from errors of sampling), so that y? will be zero 
and C will also be zero. It is evident from formula (23) that О 


1 maximum value of C 


entioned. Moreover, С is 
of the association has to be 


dom available for caleulating x?. In the case е5, а. 
table of r rows and с columns thisis given Du EE 7 
We may then consult Fisher’s table, of which aie e г i 
given below, noting the value of y2 appearing agai Vende : 
of n in the column headed P — 0-05. Tf the a К E. 
x? is greater than that given in the table, then itis gi value о 
and the null hypothesis is disproved, { 8 significant, 


e there is аг 
association between the variables, Significant 
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An extract from Fisher’s table (Ref. 2, Table III) or Fisher 
and Yates's (Ref. 5, Table IV) for P = 0-05 is given below. 


Tape УП. Table of x? for different values of n 


J Р = 0:05 
n x? n x n x 
1 3-841 11 19-675 21 32-671 
2 5-991 12 21-026 22 33-924 
3 7.815 13 22.362 23 35:172 
4 9:488 14 23-685 24 36:415 
5 11-070 15 24-996 25 37-652 
6 12:592 16 26-296 26 38:885 
T 14:067 17 27-587 27 40:113 
8 15:507 18 28:869 28 41:337 
9 16:919 19 30-144 29 42-557 
10 18:307 20 31-410 30 43-773 


significance can be tested by calculating 


If is greater than 30, 
is significant. 


(2x2) — (2n—1). If this exceeds 1:65, X? 


(2) To make use of Elderton’s table, which is given in 
Tables for Statisticians and Biometricians (Ref. 3, Table XII), 
we need to know n’, which is one more than the number of 
degrees of freedom, i.e. п’ = (r—1) (c— 1) +1. Different values 
ofn’ are given at the heads of columns in Elderton's table and 
integral values of X? are listed at the side. By looking up the 
value of P in the appropriate т’ column opposite the calculated 
value of y?, we obtain the probability of as great a value of y? 
or greater occurring as & result of random sampling; if this 

an 0:05, x? is significant. Since the listed 


value of P is less th 05, 
values of д2 are integral, it is usually necessary to interpolate 


to find the value of P. | 
The use of both tables js illustrated in Example 31. 


Tn using the method of contingency there are two provisions 
to be borne in mind. First, Е, the expected frequency, must 
be at least 5 in each cell; secondly, the table should if possible 


contain at least 5 rows and 5 columns. The former provision 


is the more important. 

Example 31. Two examiners assessed the intelligence of 200 
students, one by & yerbal test and the other by a performance 
test. Hach graded the intelligence as Very Good, Good, Fair 


csc ¢ 7 
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or Poor. The relation between the two sets of judgments is 
shown in the contingency table below. Calculate the coefficient 


of contingency and examine whether the relationship between 
the two judgments can be regarded as significant or not. 


Examiner 1 


| ув. с. F. p. 
| 19 10 8 3 

д а. 7 40 9 4 
Examiner 2 | 8 20 23 19 
In 0 8 12 10 


First we construct a contingency table, leaving room for 
three entries in each cell. Then in each cell we enter the ob- 


served frequency, the expected frequency and the difference 
between the two, thus: 


0-Е 
о 
Е 
CI) 
V.G. G Е, Р, (0 — E) 
des des I di E 
12:2 | —56| -24| дә 0 ЕВ 
V.G. | 19 10 8 3 40 
6-8 15-6 10-4 7.2 40 2010 
— 3-2 16-6 | —66| Leg 5 9.450 
сир 40 9 4 60 1004 
10-2 | 23-4] 156 10:8 60 UU 
2.792 
—3:9| -T73 4.8 6-4 0 4.281 
Е. 8 20 23 19 70 1.278 
11.9 27.3 18:2 12:6 70 1-952 
1-266 
— 5-1 —3:7 4-2 4-6 0 3.251 
p.| 0 8 12 10 30 5-100 
Sn 11-7 T8 54 30 13170 
ие 
34 78 52 | 36 200 00 
34 П в X? = 66-952 
TRU a / 66-952 
N +x? N 206952 


= 0:501. 
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} Along the bottom and at the side of the table the arithmetic 
is checked by showing that the total of the expected fre- 
quencies for each row and each column is the same as the total 
of the observed frequencies, and also that the sum of (O— Е) 
is zero for each row and each column. Then for each cell we 
calculate (О — E)?|E and tabulate the values obtained on the 
right of the table. The sum of these is y?. The remainder of the 
calculation of С is shown below the table. 

We shall assess the significance of this result by both 
methods. 

(1) Using Fisher’s table: 

n = (4-1) (4-1) = 9. 

From Table VII, for n = 9, X? = 16-919. The calculated value 
of x? is much bigger than the value in the table, hence the dis- 
tribution departs significantly from independence, i.e. there 
is a significant relation between the two sets of judgments. 

(2) Using Elderton’s table: 

n= (4— 1)(4— 1)+1 = 10, 

For п’ = 10, opposite x? = 60 we get a value of P = 0-000000. 
This means that a value of x? as big or bigger than the one 
calculated would arise as aresult ofrandom sampling less than 
once in a million times. Accordingly there is no doubt about 
the significance of the rélationship between the two sets of 


judgments. 
Note that Fi : 
or not there is a signifi 


sher's method is the easier for proving whether 
cant relationship in a single table. If 
we wish to compare the significance in two or more tables we 
need to know the value of P for each; in this case we have to 
use Elderton's table or else the complete X? tables given in 


Fisher's book. 


9.iii. 2х2 # 
when both variabl 


ables. A special form of contingency arises 
es are dichotomous, i.e. each variable can be 
divided into only two classes. The association between such 
variables may be shown by constructing a contingency table 
with only four cells, since there will be only two columns and 
Fe arcane WO shall now examine the significance of a fourfold 
or 2x 2 table making use of the x* method. 
7-2 
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In order to simplify the notation we shall denote the ob- 
served frequencies in the four cells by a, 6, сапа d as under: 


Ti Ta 

Yı a b a+b 

Yo c d с+ӣ 
а+с P b+d N 


It will be seen that N =a+b+c+d., 

The value of y? may be determined in the same way as that 
used for any contingency table, but the arithmetic may be 
shortened by using the formula 
A (ad — bc)? N 

(a+c) (0+4) (c-- d) (a+b) (29) 

The denominator of this will be seen to 
totals of the rows and columns, 

Р Having found x2, its significance may be determined by con- 
sulting Fisher's table for n = 1, № will be seen that if the 
calculated value of X? is as great or greater than 3:841, then 
there is a significant association between the two variables. 

Alternatively, the value of P may be found by consulting a 
special table given by Yule (Ref. 1, P- 534). Reference to this 
table will show that for a value of Y? equal to 3-84, Р: 0.05. 
This, of course, agrees with Fisher's table, but if exact values 
of P are required for purposes of comparison they may be 
found from Yule's table. 

As in the case of all contingency tables, the nature of the 
association in a 2 x 2 table has to be determined by inspection, 


Example 32. Is there a significant associat; 
and Y in the following data? 


x2 


be the product of the 


ion between aXe 


x 
ЕЯ To Total 
yi 65 25 90 
Y 
Yo 20 50 70 


Total 85 75 160 
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From formula (29) 


(65 x 50— 25 x 20)? x 160 
85 x 75 x 10 x 90 


This may readily be evaluated by logarithms and we find that 
x? = 30-13. 


This value is much greater than 3:84, the critical significance 
value of x? for one degree of freedom, so that we may conclude 
that there is a significant relationship between .X and Y. 


94v. Yates's correction. The y? distribution is а con- 
he number of sets of observa- 


tinuous one but in 2 x 2 tables t 
tions which can fit the marginal totals is finite and limited, 
h a table is definitely 


so that the actual distribution from suc 
discontinuous. Allowance for this fact may be made by 


applying Yates's correction for continuity. Essentially this 
consists of decreasing by $ those cell values which are greater, 
than expectation and increasing by $ those which are less 
than expectation. This has the effect of slightly decreasing 


the value of x”. 
Applying this correction, 
2 
а S) N 
о 
X = (0+0) 0+4) (c+) (a+b) 
previous Example the student may 


verify that the value of y? applying Yates's correction is 28-4. 
With such a large value the correction has not altered signi- 
ficance, but with borderline cases, where x? is only just signi- 
ficant as calculated by formula (29), the correction should 
always be applied sinceitmay very easily bring the value below 


the significance level. 


9.у. Alternative method of calculation. An alternative 
method of calculating х from a 2 x 2 table, which may be used 
asan arithmetical check on results obtained from formula (29), 


is as follows: 
в 


formula (29) becomes 
(30) 


From the data in the 
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Let p, be the proportion of y, which is in the z, class; 
рэ be the proportion of y, which is in the 2, class; 
р be the proportion of the total population in the z, 
class; and 
q-—1-—p. 
a+c 
D NE 


Then Pı 2 Pz 


. c . 
a+b’ cd 


x? is then given by the formula 
ра pyo—p(a-k c) 
24 


The algebraic proof of the identity of this expression with that 
given in formula (29) is quite simple and is left to the student 


as an exercise. Note that Yates’s correction cannot be applied 
with this method of calculation. 


Applying this formula to the data of Example 32 we have: 
pı = 65/90, p,— 20/70, р = 85/160 and 4 = 75/160, 
65х65 20х20 85 x 85 


№= (31) 


Hence = = OS 60 
85 75 
160 ^ 160 


_ 46-944 + 5-714 — 45-156 

a 0-249 

= 30-13. 
This result can be seen to b 


e identical with that obtained in 
Example 32, 


(Note. For a way of treating 
rank correlation see J. W, W) 
December 1947, pp. 295-6.) 


2х 2 tables by the method of 
hitfield, Biometrika, vol. XXXIV, 


9.vi. 2xn tables. Cases frequently oceur where we wish 
to examine the association between two variables, one of 
which is dichotomous whilst the other is capable of division 
into several, say n, classes. In such cases x? may be calculated 
by the method of Section 9.11, but the calculation of the 
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expected frequencies may be avoided in the following manner. 
Suppose the 2x n table is represented as under, y being the 
dichotomous variable and « the variable with several categories 
—four in the example. The observed frequencies in the cells 
are denoted in each case by the letter а with appropriate 


suffixes and dashes: 


23 Ze £g T4 

Yı а аз аз а т 
Ye dj a, а, а Ns 
N 


" " , й 
а + а data az+a, | tatas 


If we take a and a to represent any associated pair of 


observed frequencies, then we calculate for each pair 


1 
Ta (ana — ат}, 


are the total fre quencies in the y, and уз classes. 
ated for each pair of associated 
f all such expressions is divided by 


where, and ng 
This expression is evalu 
frequencies and the sum 0 
(n, x na), giving x°. Hence 


1 il 0 
№ = ES scs (ат —a ny] (32) 
172 


The number of degrees of freedom in a 2 x » tableisn— 1,50 
that the significance of y? in the above case may be determined 


by consulting Table VII for n = 3. 
gnificance of the association 


Example 33. Determine the si 
the following contingency 


between the variables « and y in 


table: 
vi Ta T3 ац Total я 
"n 36 42 30 12 120 
m 12 22 20 26 80 
Total 48 64 50 38 200 
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Calculating the expression 


(ат, – a'n,Y to the nearest 


аа’ 
whole number for each pair of associated frequencies, we have 
1 20) = 2078,00 _ 44 
gg (86 x 80— 12. 120)? = = 43, ЧА 
1 2 _ 518,400 PT 
g4 (42 x 80 22 x 120) 64 = ‚100, 
L (30 x 80 — 20 x 120)? = D = 0 
50 50 : 
1 2 _ 4,665,600 
zs - => > = 122,779. 
gg (12 x 80— 26 x 120) ab 122,779 
The sum of these quantities = 174,079. Therefore 
174,079 _ = 
"7320x890 181 


In Table УП for = 0-05 and n = 3 we find a value of x? of 
7.815. Our calculated value is much greater than this and we 
may therefore conclude that there is a significant association 
between the two variables, 

An alternative method of calculation, similar to that given 
for 2 х 2 tables, may be used as a check, If pis the proportion 
in the y; class in any vertical column, 1.6. р — a|(a-- a^), 
P is the proportion in the Yı class in the who 
that P = n,/N and Q = 1— P, then 


1 bi 2 : 
A polsen) mP) (Jang [s (7) “Fi vee 


To calculate this, first work out a?|(a.4- a^) for each vertical 
column and then substitute in the above formula. We obtain: 


and 
le population, so 


362/48 = 27.0 
422/64 = 27-5695 
302/50 = 18:0 


122/38 = 3-7895 


S[a?(a+a')] = 76-3520 
By substitution 


2002 1202 
2 SAU 
X = 120x 80 (76 ORE 2n) 
= 41667 x 4-352 


= 18:13. 


pe 
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9.vii. Goodness of fit. One other use of the x? method 
may be given. Suppose we have an observed frequency dis- 
tribution of a variable and wish to examine the validity of 
some hypothesis about that distribution. We may do this by 
calculating what the distribution would be on that hypothesis 
and examining the agreement, or goodness of fit, of the observed 
and calculated distributions. 

As an example, an observed frequency distribution of acci- 
dents will be given and an examination made of the hypothesis 
that the accidents are distributed in a Poisson series. If the 
probability of an event occurring is very small but а long 
enough period of observation is taken for it to happen some- 
times, во that it may happen 0, 1, 2, 3, ... times, then the dis- 
tribution of its occurrence will form what is called a Poisson 
series, if a large enough number of independent observations is 
made. The frequency distribution for 0, 1, 2, 3, ... occurrences 
will be given by the successive terms of the series 

2 m3 
Ne" (om s 3 
where N is the total number of observations and mis the mean 
number of occurrences. 
4. Examine the hypothesis that the following 


Example 3 
ts forms a Poisson series: 


frequency distribution of acciden 


Number of Observed 
accidents frequency 
14 
37 
76 


о 00 а сіс њо оно 
о 
© 


Ree 
veo 
eae 
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Tn order to calculate the expected frequency on the assump- 
tion that the distribution forms a Poisson series we need to 
know m, the mean number of accidents. This is readily calcu- , 
lated by the method of formula (2), Section 2.ii. We find that 
У(Х) = 1549, whence the mean number of accidents is 
1549/398 = 3-892. Substituting this value for m and 398 for 
N in the above formula for the Poisson series, we obtain the 


expected frequencies as below (see later for method of calcula- 
tion): 


Number of Expected 
accidents frequency 
0 8-12 
1 31-61 
2 61-51 
3 79:80 
4 77-64 
5 60:43 
6 39-20 
7 21:80 
8 10:60 
9 4:59 
10 1-78 
11 0-63 
12 o-2o[ 728 
13 0-06 
14 0-02 
397-99 


Tn order to satisfy the 
expected frequencies sh. 
entries are grouped toge 
have been calculated to 


provision that in calculating X? the 
all not be less than 5, the last six 
ther. Since the expected frequencies 
only 2 places of decimals, the total of 


Calling the observed and expeoted fre 
tively, we now construct columns 
and (0 EJ?|E. The total of the 1 
columns are given below. 

We have used ten groups to calculate у? but the number of 


degrees of freedom is only 8, since both N and m were fixed in 
the formation of the Po 


isson series, For n = 8, we find that 


quencies О and Е respec- 
headed (O— E), (0 — Е)? 
ast column is 42. All six 
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X? = 15-507 in Table VII, p. 97. The calculated value is much 
greater than this, hence the Poisson series is a bad fit to the 
Observed accident distribution. We find, in fact, from Elder- 
ton's table that for such a value of x, P is less than 0:001, 
and so we may conclude that the observed distribution 
certainly does not form a Poisson series. 


No. of (0-Е) 
accidents О E (0-Е) (0-Е) тот Е 
0 14 8-12 5:88 34-5744 4-26 

1 37 31-61 5:39 29-0521 0:92 

2 76 61-51 14-49 209-9601 3-41 

3 70 79-80 — 9-80 96-0400 1:20 

4 64 77-64 — 13-64 186-0496 2-40 

5 53 60:43 — 7-43 55-2049 0-91 

6 31 39-20 — 8-20 67-2400 1-72 

7 19 21-80 — 2.80 7-8400 0-36 

8 14 10:60 3-40 11-5600 1:09 
Эапаоуег 20 7:28 12.72 161-7984 22-23 
0-01 38-50 


(Note. The total of the (0—#) column should as usual be 
zero, It is not exactly so in our case since the expected fre- 


quencies were calculated to only two places of decimals.) 


Method of calculating the expected frequencies 


The calculation of the expected frequencies in the above 
example is best done by using logarithms. In the Poisson series 
the term Ne- is common to each successive frequency, so this 


is worked out first: 


log398 = 259988, 
loge = 0:43429, 
105 loge = 1.63778, 
log 3-892 = 0:59018; 


= antilog (1:63778 + 0:59018) 
— 1:69026; 


hence — 3:8921oge 


therefore 
log 3986-339? = 2.59988 — 1.69026 


= 0-90962, 


whence 398c-99? = 8:12. 


P 
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The values of the successive terms may then be found bya 
tabular method, making use of the logarithms and logarithms 


of factorials given in Fisher and Yates’s tables (Ref. 5, рр. 62, 
76). 


The method is indicated below. p is the index of the power 
to which m is raised in the successive terms: 


(log m? —log p!) 


logm? log p! logm?—logp! +logNe-™ antilog , 

059018 0:0 0-59018 1-49980 31-61 

1:18036 0:30103 0:87933 1.78895 61:51 

1.77054 0:77815 0:99239 1.90201 19:80 
ete. ete, ete. ete. ote. 

9 viii. 


the equation of the best fitting linear regression line and then 
check the goodness of fit by the method er analysis of variance. 
Now the algebraic equation of а Straight line is y — ax-+b. 
This is equivalent to the linear regression of ит y m o 
Stant а indicates the slope of thc regression, Én ў ая the 
height of the line above the base, Hence a is aun e in ub 
regression coefficient of Section 8i., i.e. quivalent to 
Mer Oy, 2d 
Oa eee and b= У—ах. 


If our data consist merely of a series of N Means we h ж. 
ave in- 


ly 


4 
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sufficient material for calculating r and c. However, we know 
from formula (14 A) that s 


S(XY) ұғ 
а 
r= 
0,0 
ее — 
-XY Ae 
iwo (nest N = .SQGY)-NXY (34) 
T, о? Noi 


e calculated directly from the data, 


This expression may b 
since we know the exact values of z, the independent variate. 
ated by the 


The method of calculation is simple and is illustr 

following example. ' 
Example 35. The same test was given 10 times to a group of 

subjects and the mean scores obtained were as follows: 


1272 3 4 5 .6 ones: 9 10 


Test 
17:5 205 23 25:5 28 32 35:5 37:5 


Mean score 11 14 
A graph of these means suggests that they lie roughly on a 
straight line. The line will be the regression of test score on 
order of testing so that we will call test order x, theindependent 

e dependent variate. We first 


variate, and mean score y, thi А 
calculate the coefficients a and bin the equation у = az +b and 


then proceed to examine how well a line with this equation fits 


the data. 
Five columns 
moment correlation. 


of figures are constructed as in product- 
These are headed =, y, 2*, y? and zy 
respectively and each column is summed. In order $0 reduce 
the arithmetic, the y entries may be taken from an arbitrary 
origin, say 26 in this case. The table at the top of p. 110 
shows the columns and totals obtained. 
Hence 8 = 55, : 
Nay = z.S(y) = 95x — 5:5 = — 30:25, 
Nat = 5.9(2) = 55х55 = 3025, 
xy) – Nay _ 213+30-25 
И ses ве 298 
= — 0:55, 


тав = —0°55—2:948 x 5-5 = — 16-764. 


y 
b=y 
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x y 25 y? ay 

1 —14 1 196 —14 

2 —11 4 121 —22 N= 10 
2 NEEDS 9 5625 —22-5 S(x) = 55 
4 — 45 16 20-25 —18 Sly) = = 5-5 
бое 25 4 T) S(x?) = 385 
6 0:5 36 0-25 3 S(y?) = 122-25 
7 3 49 9 21 S(xy) = 213 

8 7 64 49 56 

9 10-5 81 110-25 94-5 

10 12-5 100 156-25, 125 

55 — 5:5 385 722-25 213 


This value of b requires correction since y was taken from an 
arbitrary origin. Thisis done by the addition of 25, so that the 
true value of b is 25— 16-764 = 8-236. Hence the straight line 
which best fits the data is given by the equation 
у = 2-948z + 8-236. 
By substituting values of z from 1 to 10 in this equation we 


may caleulate the best fitting theoretical values of y and 
compare these with the observed values, as under: 


x Observed y Calculated y 
1 11 11:184 
2 14 14-132 
3 17-5 17-080 
4 20-5 20-028 
5 23 22-976 
6 25:5 25-924 
" 28 28:872 
8 32 31-820 
9 35:5 34-768 

10 37-5 37-716 


It is obvious by examination that the observed values lie 
very closely about the calculated regression line. The signi- 
ficance of the fit may be tested mathematically by an analysis 
of the variance. (See Chapter xr) For this Purpose, the 
variance of y is split into two parts, that due to linear ан 
sion and that due to departures fro: 


m linear regression, Since 
we have only mean values of y, 


the variance of y can be esti- 
mated with only N — 1 degrees of freedom, in this case 9 The 


total sum of squares about the mean is given by S(y2) — [S(y)]2 
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The part of the variance due to linear regression is given by 
multiplying the variance of = by a”, i.e. it is given by 
[S(z) 
a [se = Ber) d 
Subtraction of this from the total sum of squares gives the 
part of the variance due to departures from linear regression. 
We get in this example, therefore, the following analysis 


of variance: 
Mean 
Sum of squares D.F. squares У.Е. 
1. Total 722.25 — 3:025 = 719:225 9 = = 


2. Due to linear re- 82-5 x 2-948? = 716-983 1 716-983 2560-7 


gression 
3. Due to departures 
from linear regression 


2:242 8 0:280 


The sums of squares for lines 2 and 3 are divided by their 
respective number of degrees of freedom (р.ғ.) to obtain the 
mean squares ог estimates of the variance. Finally the variance 
ratio (v.R.) is obtained by dividing the estimated variance due 
to linear regression by that due to departures from linear 
regression, giving the ratio of 2560-7. From Fisher and Yates's 
tables (Ref. 5) we find that for n, of 1 and n; of 8 the critical 
value of the variance ratio at the 5% point is 11-26. The 
observed value of 2560-7 is greatly in excess of this, which 
Imost the whole of the variance of y is due to 


means that а 
art due to departures from linear 


linear regression and the p 
regression is negligible. 

9.ix. Logarithmic 
a straight regression line 

Example 36. А subject was given a sensori-motor test 12 
consecutive times and his scores were: 

2.0, 3:3, 4:0, 4:5, 47, 5-0, 5-5, 5.6, 5-9, 6:0, 6-1, 6:3. 

otted against order of testing, a line is 
obtained which is evidently not a straight line. The line rises 


fairly steeply at first and gradually flattens out, which suggests 
that some form of logarithmic curve might fit the data best. 


curves. It is sometimes obvious that 
will not be a good fit to observed data. 


If these scores are pl 


9.іх- 
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The simplest forms of logarithmic curves are given by the 
following equations: у = alogz-4-b, 


logy = ах +b, 
logy = alogx+b. "v 
To discover which eurve gives the best fit to the data р 
necessary to plot each one of them, giving x the values о i 
to 12, and selecting the one which gives an o WU 
straight line. With the present data it will be found t v. 
plotting y against log gives an almost straight line, henga 
we assume that the observed scores will lie very closely abou 
a curve whose equation is y = alogx+b: We have therefore 
to calculate the constants a and b and then examine the 


goodness of fit by performing an analysis of variance. 


As in the previous example, we first construct ahd sum five 
col 


umns. The first column will be values of logz, which for 


convenience we may head X. The whole calculation is given 
below: 

X 
= (z-logz) y X? y? Xy 
1 0:00 2.0 0:0000 4:00 0:0000 
2 0-30 3:3 0-0900 10-89 0-990 
3 0-48 40 0:2304 16-00 1-920 N= 12 
4 060 45 03600 20.95 2:700 S(X)= 8:68 
5.070 47 04900 22.09 3-290 (у) = 589 
8 078 50 0.6084 2500 gan S(X) = 7-4618 
7 035 66 0.7225 30:95 4:075 Sly) = 307-55 
8 1090 56. 08100. 31.86 5040 (Ху) = 47-268 
9 0-95 5:9 0:9025 


3481 5.605 
10  L00 60 10000 36.99 6-000 
ll ^ L04 61 10816 379i 6:344 
12  L08 63 11684 


39-69 6-804 
ALOON EOC зот ее 


X = 8:68/12 = 0:7233, 
NXy = 071233 58-9 — 49-6004, 
NX? = 0/7233 x 8-68 — 6:2782, 
47-268 — 42.6024 
^ 7-4618— 6.2782 = 9:942, > 
j = 58-9/12 = 4:9083, 
b = 4:9083 — 3-942 x 0.7993 — 2:0571. 


1. Total 
2. Dueto linear regression 
3. Due to departures from 
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It appears, therefore, that the original data are best fitted 
by a curve which has the equation y = 3:9421og x + 2:0571. 
Substituting values of «from 1 to 12 in this equation we obtain 
де following comparison of observed and calculated values 

Or y: 


v Observed y Calculated y 
1 1 2:0 2-057 
2 3:3 3:244 
3 4-0 . 8:988 
4 4:5 4-431 
5 47 4-813 
6 5:0 5:125 
7 5:5 5:388 
8 k 5:6 5:617 
9 5:9 5:819 
10 6:0 6-000 
11 6-1 $ 6:162 
12 6:3 6:311 


The fit appears to be very good. 
Examining this mathematically, 
we obtain the following analysis of variance: 


asin the previous example, 


Mean 

Sum of squares D.F. squares 
307-55 — 289-10 = 18:45 11 — 
1-184 x 3:942? = 18:40 1 18-40 


0:05 10 0:005 
linear regression 


This is a highly significant fit since the critical value of the 
variance ratio for nı = Land т» = 10is 10-04 at the 1% point. 


9.x. Polynomials. When a logarithmic curve which 
yields a straight regression line cannot be found, it may be 
necessary to fit à polynomial curve to observed data. Examples 


of such curves are: 
quadratic у = aa? дес, 


cubic у = aa? + ba? + cx 4- d, 
quartic y = axt + ba? + ca? - da +e, ete. 


csc 


У.Е. 


3680 


г [9.x 
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There are various ways of fitting such curves (for es 
the method of least squares), but they all involve ned 
algebraical computations and will not be described d Be 
book. (For a good account of fitting polynomials see Go 


i Td 
Methods of Statistical Analysis, Chapter xiv. J. Wiley а 
Sons, New York, 1947.) 


EXERCISES ON CHAPTER IX 
30. (a) Construct contingene: 
tween ment I in Appendi. 


Б itable 
nearly equal in size as Possible according to their scores. Suital 
Points of division for 


А (1) 39 and under: (2) 40- 49: (3) 50 and over. 
B (1) 164 and under: (2) 165-194: (3) 195 and over. 
C (1) 19 and under: (2) 20-29: (3) 30 and over. 
D (1) 24 and under: (2) 25 35: (3) 36 and over. 
Е (1) 115 and under: (2) 116-136: (3) 137 and over. 
Е (1) 24 and under: 


(2) 25- 30: (3) 31 and over. 
@ (1) 33 and under: (2) 


34— 40: (3) 41 and over. 
(b) Calculate y? and the Coefficient of contingency for each table 
and determine in which tables there i 


5 a significant relationship. 


ate Д2, using 
nt relationship 


33. In a certain experiment readings had to be made from a scale 
which was graduated in tenths of an inch. А 


d Particular observer took 
1000 readings and an analysis was made of the 


frequeney of occurrence 
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of the final digits of the observations, 
inch for which each reading was taken. 
these digits was as follows: 


Digit 


OHARA WOES 


There was no reason in the experi 


digits to occur mo 
be regarded as reliab 

(The hypothesis to 
of the digits does not dep 
expected if the observer wer 
each digit is equally 
the same for each, n: 
for 9 degrees of freedom.) 


be examined here 


e complete! 


re frequently than any ot! 
le in reading the s 


art significant! 


likely to occur, SO ti 
amely, 100. Calculate 


s the equation of the straigh 
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those showing the tenth of an 
The frequency distribution of 


Frequency 


151 
79 
95 

109 
50 

185 
67 
98 

110 
56 


1000 


for any particular final 
hers. Could this observer 


iment 


cale? 
s that the observed frequency 


ly from that which would be 
ly reliable. On this hypothesis 
hat the expected frequency is 
x? and test its significance 


ht line which best fits the 


34. What i 

following serial readings? 
No. 1 2 3 4 5 6 7 8 
Reading 5 8 104 13} 17 21 24 26 
Examine the goodness of fit by the method of analysis of variance. 


Chapter X 
RANKING AND THE AGREEMENT OF JUDGES 


10.i. Paired comparisons: the coefficient of con- 
sistence. Frequently in experimental work the investigator 
wishes to rank a number of objects in order according to some 
quality which cannot be directly measured. He might, for 
instance, wish to obtain the order of preference ofa number of 
pictures as judged by certain observers, Now it is often a 
matter of great difficulty for a judge to rank a whole group of 


objects in order of preference, but usually he can state his 
preference for one of a pair with ease. 


senting the judge serially with all possi 
to be judged, so that each. object is compared with each other 
object separately. If there are n objects, then the number of 


possible pairs is [В which is equal to Jn(n— 1). It may be 
“seen that this number of 
so that there is a practic 


n 2 n n n 
ArNe 2 3 (:) P (2) 
2 1 6 15 10 45 | 14 91 
3 3) MT 21 1 55 15 105 
4 6 mos з 12 66 16 120 
5 10 9 36 13 78 17 136 


From the results of the comparisons of the Pairs of objects 
a ranked order of preference of all the n objects may be 
worked out, but before doing this an estimate of the con- 
sistency of judgment exercised- may be made, Suppose, for 
example, that there are 8 objects which ma; 


: : У be designated 
A, B,C, D, Е, Е, Gand H. Nowifthe Judge prefers 4 to В and 
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E d PAN w heis consistent in his judgment he will 
Е еп at pair is presented to him. If, however, 
о onsistent he might, for example, prefer Ato B, BtoC 
Paus to A, which may be symbolised as A >B>C>A. This 
р Own! as а circular triad, and the number of such triads a 
Judge gives in the whole series of paired comparisons provides 
n obvious measure of his degree of consistency. (It is possible 
or a judge to give circular polyads, i.e. four or more objects 
concerned, but such polyads may always be broken up into 
triads.) 

If n is the number of objects to be judged and d is the 
number of circular triads shown in the whole series of com- 
parisons, then the coefficient of consistence, £ (zeta), may be 
calculated from one of the following expressions: 

2 


ifmisodd, &=1— зп’ (35) 
244 
(85А) 


Dro ве 
ifniseven, 6 aan 


The next point is the method of determining d. This may be 
done by inspection of the series, but the method is laborious а 
and liable to error with a long series of comparisons. The best 


method of finding Фіз as follows: 
Example 37. First construct a paired comparisons table as 


in Table VIII. This js a square table with rows and columns 
„| n and Ваз & diagonal line drawn from the 


headed A, В, C, «++ 
cell AA to the cell nn. In the table below n = 8. 
parisons have to be made, and the 


With 8 objects, 28 co™ 
result of each comparison 1s entered on the tablé in the form of 


an X An X in any cell indicates that the obje at the head of 
‘he is ЕТ E referred to the object at the head of the row 
В cell. The number of X’s in each column is 
HUP ei the totals entered at the foot of the table: the 
m thes equals 4n(n— 1), ie. with 8 objects the sum of 
the totals 1 28. : В 

ге no inconsistencies at all, the numbers at 


Now if there We x 
the foot of the table would all be different and would run from 
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Тлвгв VIII. Paired comparisons table 
и UD gi p aite 


ЕЕ 
В x < ў 
С x 
[= | 
Е x 2 
Е. X 
G X 
H x 
Total 7 


Qu E "m 3 2 2 о 28 

0 to 7 and would give the ranked order of preference. This $ 
not the case in Table VIII for there are two 3’s and two 273 
and 4 and 1 are missing. Tt follows, therefore, that there were 


some inconsistencies of judgment. The exact number of 
triads may be found by calculating 


d = dgn(n— 1) (2n—1)— 272), 
where T, is any column total. 
In the example, with n = 8, 
8x" x15 
d= =~ 15 = (72+ 6-32 524 324 02. 2?) 
= 70—68 = 2, 
Inspection of Table VIIT Shows that 
F>C>E>F and F>Os@s 
d=2andn= 8, so that his со 
by formula (35A) is 


E a 


512—323 = 0:90. 
We may regard him, therefore, as having а high degree of 
consistency and his preference ranking of the 8 objects may 
be written as ABDCEFGH, О and Б being assessed as equal 
and also F and G. 


these two triads are 
P. Wor this particular judge, 
efficient of consistence as given 
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Б 10.11. Significance of с. The following table, calculated 
y J. W. Whitfield, gives the maximum values of d for signi- 
ficance for values of n from 6 to 20: Ў 


n d n d n d 
6 0 и 30 16 121 
7 3 12 42 17 149 
8 8 13 57 18 181 
9 13 14 75 19 217 

10 20 15 96 20 258 


uld be not greater than the value 
he appropriate value of n. For 
proved. 


For ¢ to be significant, d sho 
given in this table against t 
values of n below 6, significance cannot be 
ces. It often happens in practice 
that one or more pairs in a paired comparisons experiment are 
judged equal. There are various reasons why this may be so. 
A pair of objects may actually be equal in the quality being 
judged, or the judge may think them equal, or the judge may 
be incapable of making the discriminatory judgment. In the 
last instance he would probably show many inconsistencies 
in any case. Such ties are а trouble theoretically but from the 
experimental point of vie 


w they must be allowed. 
Example 38. Table ТХ shows the method of dealing with 
such ties when they occur. This is an example where » — 7. 


Here А and C had equal preference, 80 the figure $ is written 
The same is done for the other 


10.iii. Tied preferen 


in both the AC and the CA cells. 
two equal pairs, AF and DG. From the columns’ totals the 
ranked order is CBADGEF. Tn this case d is 5, as calculated 
by the formula in the preceding section. Hence by formula (35) 
24x5 
(eee a 


Since d is greater than 3, this value is not significant, so that 
the pa rticular judge concerned would be regarded as in- 


consistent. 
10.iv. The case of т judges: coefficient of agreement. 
п of the method of paired comparisons by a 


The applicatio С 
single judge is chiefly of use when we know on other grounds 
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TABLE IX 


the true ranked order of the ob 


person might profess to be able to assess intelligence from 
photographs. We could examine his ability to do so by photo- 
graphing a number of people whose intelligence (as measured 
by intelligence tests) we knew and then 
rank the photographs for in 
parisons method. The order h 
with the previously known 
method. (See Chapter уп.) 
In practice it often occurs that we са ai 

the quality which we want to rank ieee ete 
personal judgments for this ranking, We might, for inst A a 
wish to rank a number of people for the quality of WA 
In such a case as this we should be unwillin, £ idrely on the 
preference of a single judge. We should get as many s ни 
judges аз possible and explain to them carefully what w. = 
by initiative before getting them each to rank ple ‘ae У 
separately by the paired comparisons method, 198 


jects judged. For example, a 


А, 


\ 
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Suppose we had n people to be assessed and m judges. We 
should first examiné the consistency of each judge by the 
method described previously. This would involve making m 
tables of paired comparisons and calculating the coefficient of 
consistency for each. In practice we should then probably 
omit any judge or judges who were very inconsistent since 
their judgments would not be reliable. Let us assume here 
that all m judges showed a high degree of consistency. The 
next point to be decided is whether or not the judges agree 
with one another. This may be examined by calculating the 
coefficient of agreement, ш 

To do this we first construct a summed paired comparisons 
table. This is done by drawing up an n x n table and entering 
in each cell the sum of the entries in the m separate tables. 
For example, count the number of crosses and 4’s in all the 
т AB cells in the separate tables and write the total in the AB 
cell of the summed table, and so on. The columns are summed 
as usual and the total of these sums is 3mn(n — 1). 


Taste X. Summed paired comparisons table: n = 6, т = 5 


75=4 mn (n—1) 


92-5 
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Example 39. Table X shows a summed paired comparisons 
table for 6 objects judged by 5 observers, so that n = 6 and 
m = 5. We shall deal first with the case where there are no 1's 
in the summed table in order to illustrate two methods of 
calculation. The number written in the top right-hand corner 
of each cell is the total of the entries in the 5 corresponding 
cells in the separate tables (which are not reproduced). Let 
this entry for any cell be symbolised by y (gamma). Then for 
each cell we calculate }y(y—1) and write the answer in the 
bottom left-hand corner of each cell. If we add all these values 


together we obtain У (5 › which we may call X. Then v, the 


coefficient of agreement, is given by the formula 
2» 8X 


= = 6)* 
т) (т 1 mn(m —– 1) (n — 1) E. (36) 
2J \2 ‘ 

From the data in Table X we obtain X = 92, Hence in this 
8x92 


example u = BxOxdxp = 0227, 


There is an alternative method of calculating X which may 


Z(y). Then square each у and s 


ium to obtain X(»2 В is then 
given by the formula am (0°). Z is the 

У = (у) тубу) (2) (3): (37) 
In the example in Table X, we find *( ) = 43 а 
Hence, from formula (37), U and X(y?) = 157. 


2 = 157—5x43+10x 15 = 99. 
Note that the same result may be obtained as a check Ра 
confining ourselves to the portion of the table above the 
diagonal. In this case we find X(y) 


= 32 and X(yz) = 102. 
Hence Ў = 102—5x 32-10 x 15 = 99 Q9 


In practice, apart from checking, 


one would w à 
half of the table where the numbers were aie with the 


* See Appendix H for a table to help in the calc, nil Frage 
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10.v. Significance of и. The significance of и may be 
tested by calculating y?. Making a correction for continuity 
(similar to Yates's correction), we have in this case the fol- 
lowing expression for д2: 


m—3| 4 
"EE 
The number of degrees of freedom, v (nu), corresponding to 
this value of x? is given by 


() m(m—1) 
(т—2) ` 
The significance of X? is then examined by calculating 

(2x2) — .(2v—1). If this expression is greater than 1-65, X? 

and consequently v is significant. 

In the example in 10.iv we have n = 6, m = 5 and X = 92. 


From formula (38) 
№ = (92—1—H(15 x 10) $}§ = 54. 


From formula (39) 


v= (39)* 


Е 
PERDU Mo 
3x3 


Hence AJ(2x3)— A(2»—1) = A/(109-3) — (65-6) 
= 10:46— 8-10 = 2-36. 


This is greater than 1-65, therefore u is significant. 
From the table of the normal curve (see Ref. 1, Appendix, 
- Table 2), it may be seen that the probability of getting a value 
of X? as large or large? than that obtained is P — 0-009. 

(Note. For exact probabilities for values of m from 3 to 6 
inclusive and small values of n, see Kendall, vol. т, Tables 
16-11-16-14, Ref. 6). 

10.vi. Combination of several rankings. Suppose we 
have obtained rankings of n objects by m judges, either 
directly or by the method of paired comparisons, and we wish 

* See Appendix І for values of v and Appendix J for a table to help 
in the calculation of X*- 
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to combine these separate rankings into a singleranking iir 
may be taken as the best representation of the consensus 0: 
judgment. The simplest and best way of doing this is to sum 


the m rankings for each object and re-rank the n totals thus 
obtained. 


Example 40. Suppose 3 judges ranked 8 objects as follows: 


SIEHE EC PD p TG EH 
Judge 1 СОА Г vr, 908 
Judge 2 а Se Бу 3. c8 
Judge 3 ЕВ Miss a ede v8 
Sum Юа те м 
Combinediren MEE MEE, Mr р ОЕ 


The last row is the combined r 
the sums in the row above. 
Sometimes the sums of rankings for two objects may be 
equal, for example 1+1+3 — 5 and 2+2+41 = 5. These two 
objects would be tied in the combined ranking, but if for а 
particular purpose we wished to avoid ties, we should take 


ank order obtained by ranking 


rankings in the separate judgments 


10.vii. Coefficient of Concordance, The degree of re- 
semblance between m rankings of the ene o 
be estimated by calculating the coefficient of concordance, W. 
To do this, first sum the m ranks for each object as in 10.vi. 
The total of these sums is тп +1) and their mur 
ym(n+1). Take the deviation of each ЕЕ а 07 
and square it. Add the squares to obtain a total sum of squares 
which we will call S. © is then the variance of the clea 
rankings. Now if the concordance amongst; the а 
perfect, these sums would be m, 2m, 3m, . 


^w MM, and their 
variance would be j5m*(n?—m). The Coefficient W is m 
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given by the ratio of the observed variance S to the variance 
for perfect concordance, i.e. 

128 

y ——— ——. 40)* 

Ш m?(n§—n) Go) 

In Example 40, m = З and n = 8. The sums of ranks were 

18, 8, 4, 19, 15, 11, 9 and 24. These total to 108 with a mean of 


13}. Then 5, is equal to 
(18 —134)?+ (8— 133)? + ... + (24—134)?. 
The total of this is 310. Hence, from formula (40), 


12 x 310 
E E EN a +82. 
V = 9612-8) 2 
W varies between 0 and 1 and is unity only when all the 
rankings are identical. 
W is related to the average Spearman p between all possible 
Pairs of rankings, the relationship being given by 
pm Wiss 
(ASD = mea 


Formula (40) holds only when there are no tied ranks. When 
ties are present it needs modification. Consider any one row 
of rankings. If there are ż ranks tied in any one set, calculate 
1s(8 — t). Do this for each set of ties and add for the whole row, 
obtaining 2,2113 — t). Call this total 7, for the first row, T, for 
the second and so on. Then if there are m rows, add the m 
values of T to obtain X(T)). Then the modified form of formula 
(40) becomes S 


W= твп) maT) (40A) 
The following small table is of help in calculating the 7’s: 


122—1) 
i 
2 
5 
10 
174 


C CDU uU t 


* See Appendix K for a table of m*(n*—5)/12 
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Example 41. Calculate the value of W for the following 
data, showing 3 rankings of the same 10 objects: 


K 
Object УЕБ CMN DU NE HN G H ^ К 
Banking] 1 2 3 4 4 6 7% т IM 
Ranking2 1 24 2 44 4 6 6 8 A 
Ranking 3 IA S9 44 44 4 4 8 8 d 
Sums з 6} 10 13) 13} I7 22 23} 26} 29i 


The sums total 165 with a mean of 16}. Subtracting 16} from 
each sum, squaring and adding we get 5 = 691. h 

Now in the first row there are 2 ties with 2 members each, 
во that T, =}+}=1. Similarly, 7, = 2 and 7, = 7. Therefore 


X(T) = 10. Then, from formula (40 A), 
691 
= = 0-970. 
Й = aost = 997 
10.viii. Significance of W. The significance of W may 
be tested a 


s for Fisher's z (Section 8.1) by writing 
2 (n—1W 
2 = flog, и - 


The appropriate degrees of freedom 
(Fisher and Yates, Ref. 5, Table V) 


(41) 


for consulting the z table 
are 


yee | T 


va = (n—1)y,. 

When ties are present, the Same test may be used provided 
X(T) is small compared with Im? (n? —n). Ifa large part of 
from one judge, we might 
regard him as being relatively incapable of making the judg- 
ments and so omit his rankings, Great care has to be exercised, 


yet be wrong. An 
the others and yet be 
right. 
Anaona ао formula (41) using logs to the base 
10 instead of natural logarithms is 
z = 1-1513[log (m 


ИУ (aA) 
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Example 42. Examine the significance of the value of W 
obtained in Example 40. Here m = 3, n = 8, W = 0°82. 
Hence „ _ 1-1513[log (2 x 0-82) —log (1 — 0-82)] 
1-1513(0-2148 — 1.2553) 
= 1-1513 x 0:9595 
= 1:1047. 
v, = 61 and и, = 122. From Table V in Fisher and Yates 
(Ref. 5) we find for v, = 6 and v, = 13 a value of z of 0:5350 
for the 5 % point of significance. The calculated value is much 


greater than this and is therefore significant. 
Using formula (41) we get 


I z = flog, 9-11 = 1-1047. 
(Note. The methods described in this chapter are mainly 
due to М. С. Kendall; for their derivation, etc., see his Rank 
Correlation Methods, Griffin and Co., 1948.) 


EXERCISES ON CHAPTER X 


35. In a paired comparisons experiment, a subject was required to 
rank 7 photographs, labelled A to G, in order of the intelligence of 
expression portrayed. All possible pairs were presented in random 
order and the 21 judgments made were as follows (recorded here in 
Tegular order for convenience): 


A>B B>D C«G 
A2C В>Е D>E 
A>D B«F D>F 
А<Е B<G D<G@ 
A>F C-D Е>Е 
A«G C-E E-G 
B>C C>F F<@ 


Construct a paired comparisons table from these data. 
(а) Calculate а, the number of circular triads, and 
table which are the triads. › and find out from the 
(b) Calculate C, the coefficient of consistence, 
(c) What is the ranked order of the 7 photographs ? 
d) What conclusions would $ 
(а) you draw from these findings? 
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36. The experiment in the previous exercise was made on 6 pube 
and the results are shown in the summed paired comparisons table 


below: 
ks w io!) Sol вр ai] 
A 0 0 0 3 2 6 
B 6 2 2 1 4 6 
c 6 4 1 1 3 6 
D 6 4 5 1 0 6 
E 3 5 5 5 0 6 
T^ 1195 3 6 6 6 
CER UNO 0 0 0 0 0 
25 15 15 14 12 9 36 126 
(а) Calculate м, the Coefficient of agreement, and examine #3 
significance. 


(b) What conclusions do you now draw about the photographs? 
37. Ten men in a factory were ranked in order of their degroe of 
possession of initiative by e 


ach of 4 foremen, These were the rankings: 
Workman 1342 3 


4 5 6 7 8 9 10 
Rankingl 4 3 9 10 1 8 2 6 5 7 
Ranking 2 4 1 10 9 2 5 3 8 6 1 
Ranking3 7 4 8} ЗЕТ i 6 3 10 
Ranking4 5 2 8 9 2 7 


" 20 10:74 396 
Calculate W, the coefficient of concordance between the rankings and 
examine its significance. 


Chapter XI 
INTRODUCTION TO ANALYSIS OF VARIANCE 


114. Fundamental nature. The technique of analysis 
of variance (which is really a misnomer, as may be seen later) 
was developed by R. A. Fisher. It has since been so widely 
used and elaborated that the literature on the subject is 
enormous and discouraging to the elementary student. The 
object of this chapter is to explain the fundamental nature 
of the method and to give some simple illustrations of its 
use. h 

Essentially, analysis of variance provides а test of the 
homogeneity of a set of data. By homogeneity we mean that 
all the observations could have been drawn from the same 
Population, which we may call the parent population, this 
Population being normally distributed and having a certain 
mean and variance. In any experiment there may be several 
factors at work, each of which may cause & certain amount of 
variability in the observations made. Thus if several varieties 
of potatoes are sown in different types of soil and treated with 
different sorts of manure, the variation in the various yields 
of tubers may be affected by variety of seed, by variety of 
soil or by variety of manure. Analysis of variance enables us 
to discover how much of the total variability is due to each 
factor and a comparison of these contributory amounts of 
Variation provides a test of the homogeneity of the observa- 
tions. Tf the data are shown not to be homogeneous, then the 
factor or factors causing the heterogeneity may be isolated 
and the relative amounts of their effects discovered. It should 
be emphasised at the outset that for this analysis to be pos- 
sible, it is essential that the experiment be very carefully 
planned so as to justify the assumptions on which the method ` 
of variate analysis is based. These assumptions will be pointed 
Out later. i dE 

The method derives from the fact that variance is an 


csc 9 


130 INTRODUCTION TO [11.i- 


additive quantity. Suppose X and Y are two independent, 
normally distributed variables. (By independent is meant 


AMT Y 
uncorrelated.) Then the variance of X is м5 —X)? and 


that of Y is EAT Y — Y)?, where N is the number in а sample of 


each variable. In a similar way, the variance of (X 4- Y) will be 
y SQ Y)-(XiYy- gS Y-X-Yy. 
Rearranging and expanding we have 


ВХ - X) ar- Ye 


m Е - Xy r- Yyasx—X)Y-Y) 


-sZ Xyex(Y Y*ex(x Х) (7-7). 


The last term on the right is twice the co-variance of X and Y 


and since the two variables are uncorrelated this must be 
equal to zero. Hence the variance of 


Е (X + Y) = (variance of X) + (variance of Y). 


. The same sort of reasoning may be ap 


5 plied to three or more 
variables. 


Iftherefore we have a set of observations each of which may 
be regarded as falling into several independent classifications, 
the total variance may be analysed into sey 
for each of the separate classifications, Further, if the classes 
are really independent and the data are homogeneous, then 
the variance in each classification is an independent inate 
of the parent variance, and these separate estimates will all 
be equal within the limits of sampling error, If they are not 
equal, then the data are not homogeneous and we come to ће. 
conclusion that the data in different classes are drawn from 
different populations having different means. The important 
point to remember in applying the test for homogeneity (how 


eral variances, one 
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this is done is explained below) is that three assumptions are 
made: 
(1) that the data are normally distributed; 
(2) that the separate estimates of variance are independent; 
(3) that the variance in each class is the same. 


пн. One-way classification. In Section 5.iii we ex- 
amined the probability that two sample means came from 
the same population, using the ¢ test. If we had three or more 
samples it would be possible to apply the ż test to each pair 
of means, but the homogeneity of the samples can be tested 
more easily by the analysis of variance. 


Example 43. Numbers of leaves were taken from each of 
half a dozen trees and their lengths measured. The following 
are the measurements in millimetres: 


Tree Length 

82, 87, 86, 90, 81, 84 

85, 84, 91, 92, 88 

92, 90, 84, 86, 88, 93, 89, 90 
80, 86, 87, 81, 82, 82 

87, 86, 88, 90, 85, 86, 87 

90, 96, 84, 85, 85, 86, 87, 84, 87 


[2-3 шо о 


What we wish to know is: can all these leaves be regarded as 
having come from the same species of tree? у 
Now if X represents the length of any leaf and we lump all 
the samples together to form one group of number N, then 
the variance of the whole group is given as usual by 
, SX-F} 
Салуу тыз 


since we are estimating the variance of the parent population 
The expression S(X — X)? is the sum of the squares 5: 
deviations from the mean and will be referred to for con 
venience аз 3.0.3. This sum of squares may be anal; into 
two parts, each one of which provides an тие E 
of the parent variance. pe 
Let X be the general mean of the whole X 
ou 
sample mean. Then the total 5.0.5. may be Beek pe) 


БОХ) = S(X- X, X,— X). 
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Grouping and expanding the right-hand side we get 
S{(X-X,)+ (х, X)? 3 5 Y x) 
= S(X—X,)?+S(X,—X)2+28(X — X) (X,— X). 
ite 
The product term on the right equals zero, hence we may wri 
S(X— Xy = S(X — X? S(X,— Xy. 
"vati in 
The first part on the right is the 5.0.3. of the obser iei i 
each sample about their respective means: the second is 


o 
5.0.5. of the sample means about the general mena 
caleulate these sums of Squares we make use of the identity 


S(X — Xy = S(X2) — [8 (ХМ. 


First we find S(X ) and S(X2) for each sample. These may con- 


S in — 
veniently be arranged with the sample means as shown 1 
Table XI. ` 
TABLE XI 
No. in 
Sample sample, N, S(X) S(X2) x, у 
1 6 510 43,406 85 
2 5 440 38,770 88 
3 8 712 63,430 89 
4 6 498 41,374 83 
5 7 609 52,999 87 
6 9 774 66,592 86 
Totals — 41 — N , 3543 306,571 86-41 = X 
Note that for sample 1, S(X2) is given by 
82°-+ 879486749024 g124 вда 43,406, 


and similarly for the other samples, 
The total 3.0.3. is then | 


306571 — 3543/41 
= 306571 —306167.05 
= 403-95, 
The 5.0.5. of the sample means about the 


A : Бепега] mean, i.e. 
S(X,— X), is obtained by squaring S(X) for each sample, 
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dividing each square by №, the number in that sample, addin; 
the six results and subtracting [S(X)/N. Thus we have 


5102 4402 7122 498? 6092 774° 3543? 
"EC ear ae, a ee 
= 306319 — 306167-05 
= 161-95. 


The 3.0.3. of sample observations about sample means is 
obtained by subtracting [S (X P/N; from S(X2) for each sample 
and adding the six results. We have, therefore, 


S(X — X,)? = (43406 — 5107/6) + (38770 — 4402/5) 
+ (63430 — 7122/8) + (41374 — 498/6) 
+ (82999 — 6092/7) + (66592 — 1748/9) 
= 56+50+62-+40+16+28 


S(X,— Xy 


= 252. 


і We now have the necessary sums of squares. To obtain the 

“estimates of variance these have to be divided by the respec- 

tive number of degrees of freedom (D.F.). The total number of 
D.F.is41— 1 = 40. Ifthere are P samples, the D.F. = P-1=5 
in this case. The D.F. from individuals in the samples about 
their sample means — S(N,-1)- N-P = 35. The complete 
analysis of variance then takes the following tabular form: 


3.0.5. 4 D.F. Mean square 
Between samples 151-95 EJ 30:39 
Within samples 252 35 7:90 
Total 403-95 40 10-099 


Tt will be noted that the first two columns of figures are 
additive but not the last column, which gives the mean squares 
or estimates of the parent variance. In other words, we have 
actually analysed the total 5.0.5: and not the variance. The 
first two variances in the last column are independent esti- 
mates and so may be compared. The variance calculated from 
the total s.o.s. is not independent since it includes the other 
two. 
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There are now two ways of testing the homogeneity of the 
samples. First we may calculate 
1 S(X,— X) 
260 е, Pal 


_Х 
Jog, SEA 


€ 
and consult Table V in Fisher and Yates's Tables for n, — 5 
and п, = 35. In this case z = 0.7200. From the table for 
т = 5 and n, = 30 we see a value of z of 0-4648 for Р = 0-05. 
Since the calculated zis much bigger than the onein the tables, 
it means that the data depart significantly from homogeneity. 

Alternatively, and more simply, we may calculate the 
variance ratio, у.в., by dividing the variance between samples 
by the variance within samples and again consulting Table V ` 
in Fisher and Yates. In this case v.n. = 30-39/7-2 = 4-22, In 
the table for n, = 5 and л, = 30 we find а value for the У.Е. 
of 2-53 for P — 0-05, again showing that the calculated У.Е. 
would occur by chance very much less often than 1 in 20 times. 
Hence the data depart significantly from homogeneity. 

The above example has been worked out in full to show 
exactly what is being done. In practice the arithmetic may be 


reduced. First we take a working mean of, say, 85 and reduce 
the original data to the following: 


Tree Length (from 85) 
1 —9,2, 1,5, —4, —1 
2 0, —1, 6, 7, 3 
3 7,5, —1, 1, 3, 8/4, 5 
4 = 5, 1, 2, —4, —3, —3 
5 2, 1, 3, 5, 0, 1, 2 
6 


5, 1, —1, 0, 0, 1, 2, —1,2 
From these data Table ХТ becomes Table XI A. 


TABLE ХТА 
Sample N, S(X) S(X?) X, 
1 6 0 56 0 
2 5 15 95 3 
3 8 32 190 4 
4 6 —12 64 —2 
5 7 14 44 2 
6 9 9 37 1 
Totals 4l =N 58 486 141-1 
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From this, total 
5.0.5. = 486 — 58?/41 = 486 — 82-05 = 403-95. 
For the sum of squares between samples we make use of the 
identity 
S(X,— Xy = SUS(X) ZJ- ISEN. 
In this case we have 
5.0.5. between samples 
= (0х0) + (18x 3) + (32x 4) + (7 12x - 2) 
+ (14х2)+ (9 x 1) – 58 х 58/41 
= 04-454 128+ 24+ 28 + 9 — 82-05 
= 151-95. 
The 5.0.5. of observations about their respective sample 


means need not be calculated directly but may be obtained by 


Subtracting the 5.0.5. between samples from the total 5.0.5. 


his may be entered in the analysis of variance table as the 


‘remainder’. The table then takes this form: 


р.ғ. Mean square V.R. 


5.0.5. 
Between samples 151-95 15 30:39 4:22 
Remainder 252 35 0220 T 
Total 403-95 40 10-099 


11.111. Interpretation of analysis. Having decided from our 
t homogeneous we have to ask 


Was taken from а 
largest mean length. 
be decided by calcu 


between sample means 
variance. If the data were homogeneous, the best estimate 


would be the one based on the largest number of D.F., i.e. that 
obtained from the total 3.0.5. However, the data are not 
go that the best estimate of the variance for 


homogeneous, ; Е 
е standard error is the ‘remainder’ variance, 


calculating th 
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since from this the variance due to differences between sample 
means has been eliminated. This variance is 7-20, so that the 
standard error of a sample of size N drawn from a population 
of estimated variance 7-20 will be J/(7-20/N) = 2-683/4/N. 

The standard error of the difference between two means of 
samples of N, and N, respectively will be 


sei 
2633 fx) (see Chapter v). 


The difference between the means of samples 3 and 4 in 


the previous example is 89—83 = 6 and the numbers in the 
samples are 8 and 6. Hence 


S.e. of the difference 
= 2-683 J(1/8-4- 1/6) = 2-683 x 0:54 


= 1:449. 
Difference/S.e. of difference 
= 6/1-449 = 4-14, 


Normally we should infer from this, if we had only two 
samples, that this difference between the means is definitely 
significant. However, a little care is necessary in coming to 


this conclusion. We have 6 samples so that we might compare 


15 different pairs of means. Above we chose the biggest differ- 
ence so that we ou, 


ght to apply a stricter test of significance 
than that of P — 0-05. It seems reasonable in this case to 
demand that the probability of a difference of the observed 
size arising by chance should be 1 in 20 x 15, i.e. 1 in 300. This 
corresponds roughly to a case where a difference is 2-9 times 
its standard error, as may be seen from the table of the pro- 
bability integral. Hence we may take it in this case that any 
difference between means that is 2-9 times its standard error, 
or greater, is significant. Applying this test to all possible 
pairs in the example we obtain the results shown in the table 

on p. 137. . з 
The order of the samples arranged according to mean value 

is 3, 2, 5, 6, 1, 4. From the table it may be seen that onl 
samples 3 and 4 and 2 and 4 are significantly different, We 
should suspect, therefore, that samples 2 and 3 came from a 


l——— 
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Samples Difference S.e. of diff. Diff./S.e. 

1 апа 2 3 1-625 ‚ 1:8 

land 3 4 1-449 2:8 

land 4 2 1-549 1:3 

land 5 2 1-493 1-3 

1 and 6 1 1:414 0-7 

2 and 3 1 1:530 0:7 

2 ава 4 5 1-625 3-1 Significant 
2 апа 5 1 1-571 0-6 

2 and 6 2 1:497 r3 

3 and 4 6 1:449 4-1 Significant 
3 and 5 2 1:388 14 

3 and 6 3 1:304 2.3 

4 and 5 4 1:493 2-1 

4 and 6 3 1:414 2:1 

5 and 6 1 1:352 0:7 


different species of tree from sample 4 but which of the other 
samples came from which species cannot be deduced from the 
present data. There is an indication that with larger samples 
the differences between the means of 3 and 6 and 3 and 1 might 
appear as significant, and also that between 4 and 5. We 
might suspect then that samples 3, 2 and 5 came from one 
species of tree and samples 6, 1 and 4 from another, but we 
could not prove this from the present data. For this we should 
need larger samples or some other means of identification, 


Such as shape. 


11.iv. Two-way classification. In the last example, 
the data all belonged to a single family, leaf-lengths. It 
frequently happens that aset of observations may be regarded 
as belonging to two families at the same time. 


Example 44. In an experiment on the effects of tem- 
perature conditions on human performance, 8 practised 
subjects were given a sensori-motor test in each of 4 tempera- 
ture conditions. Since the subjects were well practised, the 
order in which the tests were done was unimportant. The tests 
were randomised amongst the subjects, so that for each con- 
dition there were equal numbers of first testing, second testing 
third testing and fourth testing. The scores in the tests m 


shown below. 1 d 
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TABLE XII 
Subjects (A) 
1 2 3 4 5 6 7 8 Totals 
76 80 79 90 85 101 94 83 688 
75 81 77 90 86 98 93 85 685 


бы 78) 116) 9L 532: «98 92 83 676 
685) 75) 272! 155 3% "90. 82-17 631 


Totals 295 314 304 356 335 387 361 328 2680 


Call the number of subjects №, and the number of conditions 


№. In such a case as this the total 3.0.5. may be analysed into 
three parts: 


Tempera- 
tures (B) 


Rone 


_(i) the 3.0.5. of A means about the general mean, i.e. 
S(X,—<X)?, with N,—1 p.r.; 

(ii) the s.o.s. of B means about the general mean, i.e. 
S(X,— Xy, with N,—1 D.F.; 

(iii) а remainder obtained by subtracting (i) and (ii) from 
the total s.o.s. This has (№, — 1) (N, — 1) D.F. 

The total s.o.s., with №, №, — 1 D.F., is calculated as in the 
previous example by summing the squares of all the readings 
and subtracting [S(X )P/N For convenience and to reduce the 
arithmetic the data in Table ХИ may be written down from 


а working mean of 84 (since 2680/32 = 83-75). This yields 
Table XII A. 


TABLE ХПА 
Subjects (A) 
1 2 3 4 5 6 7 8 Totals 


Tempera- 1 —8 — 4 —5 6 U о al), oth 16 
tures(B) 12 — 9 —3 —7 6 2 14 g 1 13 
3 8 6 BL. he ОА Вт 1 

Ba Re) PS t esp ua) а ЙН 

Totals Sal) 224232090. Si 50 БЕЗ rs 


The total s.o.s. 
= (—8)?+(—4)?+(—5)?+...4 (—2)?-+ (—7)?—(—8)2/39 
= 2042—2 = 2040. 


Great care should be taken with this part of the calculation 
as there is no simple check on the arithmetic. 
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To obtain S(X,, — X), square and sum the 8 totals for the 
A grouping, divide by 4 (since there are 4 readings in each), 


and subtract [S(X)]?/N. 
Hence 


S(X,— Xj = Uf(—41)2 + (—22)2 + (— 32)? + 20 + (— 1)? 
4512+ 252+ (—8)7]—-2 
= 6880/4—2 
== 18: 

In a similar way, S(X,—)? is obtained by squaring and 
summing the 4 totals for the B groups, dividing by 8 and sub- 
tracting [S(X)]?/N. Hence 

S(X,— Xy = 4162+ 1974+ 4+ (741)]-2 
= 1922/8—2 
= 238-25. 

We may now construct the analysis table as below, obtaining 

the ‘remainder’ 5.0.5. by subtraction. 


р.ғ. Mean square V.R. 


5.0.8. 
Between subjects 1718 7 245:4 61:5 
Between temperatures 238-25 3 79:42 19:9 
Remainder 83:75 21 3:99 = 
Total 2040 31 65:8 


In this table the first three estimates of the parent variance, 
given in the mean square column, are independent. The v.R. 
is obtained in each case by calculating the ratio of the first 
two variances to the remainder variance. Reference to Fisher 
and Yates’s Table V shows that both these ratios are highly 
Significant, so that we reject the hypothesis that the data 
are homogeneous. The variance between subjects is very large, 
Showing that there are highly significant differences in ability 
between the subjects, but we are not interested in this. What 
does interest us is the effect of temperature conditions. The 
means for these conditions, obtained by dividing the B totals 
in Table XII by 8, are 86, 85-625, 84-5 and 78-875. Since the 
data are not homogeneous we take the remainder variance as 
the best estimate of the parent variance, so that the standard 
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error of a mean of N individuals drawn at random will be 
(3:99/N) = 21 JN. 
Now we have 6 possible pairs of means to compare, hence it 
is reasonable to use a level of significance equal to 1 in 20 x 6, 
le. 1 in 120. This corresponds to a ratio between а difference 
and its standard error of about 2:6. Since there are 8 readings 
in each temperature mean, the standard error of the difference 
between the means of two samples is 2 ./(£++4) = 1. Hence 
means differing by more than 2:6 will be significantly different, 
i.e. the mean of the fourth temperature condition is signi- 
ficantly different from the means of the other three conditions, 
but these three do not differ significantly from one another. 


11.у. Three-way classification. In the previous ех- 
ample, the remainder $.0.5., which would be symbolised as 
S(X-X,-X,+X)2, is sometimes referred to аз an inter- 
action. When we have a set of data which can be classified in 
three ways, A, Band C, there will be three such terms, one for 
the interaction of A and B, one for А and C and the third for 
В and C. The total 3.0.3. in such a case may therefore be 
analysed into seven parts. First there will be three parts given 
by S(X,— Х)2, S(X,— Xy and S(X,— X)*; these show the 
effects of 4, B and C Separately and are usually known as the 
main effects. Next will be the three interaction terms men- 
tioned above, Interactions involving two main effects are 
known as first-order interactions. Finally, the seventh part in 
the analysis is a remainder term. This term is in fact the inter- 
action of all three main effects and is known as a second-order 
interaction. The number of degrees of freedom involved in 
interactions is always the Product of the p.v. of the component 
main effects, so that if the numbers of groupings in the classes 
A, B and C are р, q and r respectively, the р.р. for interaction 
AB will be (p—1)(q—1), for AG (р—1) (4—1) and for BC 
(g — 1) (r — 1), whilst for the second-order interaction АВС the 
D.F. will be (p—1) (q—1) (y— 1). The total D.F. are (pqr — 1) 
and the student may easily verify by algebra that this is equal 
to the sum of the D.r. of the seven parts in the analysis. 

The caleulation of the various sums of squares is similar 
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in nature to that of the previous example but a little more 
complicated. The method is fully illustrated in the following 
example. 

Example 45. In a certain psychological experiment the 
apparatus consisted of a dial having a rotating needle and a 
number of graduations round the circumference. The needle 
could be rotated at 3 different speeds and the dial displayed 
under 3 different intensities of illumination. Subjects in this 
experiment were required to make a certain reaction each 
time the needle reached a graduation on the dial. Since there 
are nine combinations of speed and illumination, each subject 
had 9 tasks to perform. The order of performance was ran- 
domised for different subjects. Table XIII below gives the 
number of correct. reactions made in each of the 9 tasks by 
6 different; subjects. The object of the experiment was to dis- 
cover the relative effects of speed and intensity of illumination 
оп performance. (Note. This is а greatly simplified version 
id is given here solely for the purpose 


ofan actual experiment an: : 
Of illustrating the analysis of variance with such data.) , 


Tasis XIII 


Illuminations (A) 


1 3 
Speeds (B) Speeds (B) Speeds (B) 
LI Ie E 


ае mine 8 Я в 

Subjects ag 38 29 43 35 26 35 29 18 
jets (C) Y. S эз 20 40 32 21 34 25 19 

3 39 32 9] 41 329 25 29 24 16 
q 07590187, 24 39 30 25 30 27 14 
5 40 36 og 42 314 24 31 26 17 
6 40 35 95 40 32 22 32 26 16 
This table gives 
elongs to a certain $ 
Certain intensity of i 


a three-way classification as each entry 
ubject (C) at a certain speed (B) under a 
llumination (A). The first step in the 
analysis is to find а convenient working mean and rewrite 
Table XIII from this mean, inserting totals at the same time. 
Since the range of readings is from 14 to 45, a convenient 


Working mean is 30. R 
We get Table XIITA. 


ewriting the readings about this mean 
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TABLE XIITA 


Illuminations (4) 


2 3 
Speeds (B) Speeds (B) Speeds (B) й 
GRE eee 
TES ES E TOlaU таи 3 Total 1 2 3 Total Total 
Subjects (C) 28 
n MEE аз NISI ЕБЕТ s: б1р та ЯЕ 
2 D. от ор СЕС rM T Se ат 
3 9 2 9 2 1} 1 45 5 -1 -6 -14 219 SU 
4 I MNT YO RE YN NONSE S од Оз Sag Е 9 5 
5 Ое аз P ПТ E E EE NÉ 58 о L5 
6 i) d. Pp BWW ЯВ ELE TESTA =16 = 
Ret U T A 65 5) Zar rT 


The total 3.0.3. may be obtained from this table in the usual 


way, i.e. square each single entry (not the totals), add and 
subtract [S(X)]?}/N. In this case 


total 3.0.3. = 1524 824+ (—1)2+... 
+2 +(—4)2+ (— 14)2— 82/54 
= 3402 — 1-185 
= 3400:815. 
Main effects 
(1) The totals for the A grouping are 63, 37 and — 92, each 
being the sum of 18 readings. 


“Hence S(X,—X)? = [637+ 3724 (— 92)2]/18 — 82/54 


= 13802/18 — 1.185 
= 765-593. 
(2) The totals for В are obtained by adding together the 
three totals for B,, B, and B, separately, i.e. 
total B, — 65+65+11 = 141, 
total В, = 314 9—93 — 17, 
total B, = — 33—37 — 80 = — 150. 
X,-Xy = [1412 4.1724 (—150)]/18— 82/54 
= 42670/18—1-185 
= 2369-370. 


(3) The C totals are given at the right of Table ХИТА and 
each contains 9 readings. 


Hence S( 
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Hence 
S(X,— Xy 
= [282+ (— 8)? (— 148+ (— 1)? 52+ (—2)?]/9 — 8/54 
1074/9 — 1-185 
118-148. 


First-order interactions 

The interactions АВ, AC and BC are best found by first 
extracting three two-way tables from Table ХА. 

(4) That for AB is shown in Table XIITB. 


TABLE XIIIB 
А, A, As Totals 
B, 65 65 1 141 
B, 31 9 —23 17 
B, 293 —37 —80 — 150 
Totals 63 37 —92 8 


Now reference to the previous example shows that, to 
calculate the 5.0.5. for the remainder or interaction in a two- 
way table, we calculated the total s.o.s. for the-table and 
subtracted from it the sum of the 8.0.8. of the main effects. 
Here we proceed likewise, bearing in mind that each reading in 
Table XIIIB is the sum of 6 readings. Hence for Table XIII B, 

the total s.o.s. = [652+ 65? + 112+312+ 92+ (— 23)? 

4-(—33)?-- (— 37)? + (—80)?]/6 — 82/54. 
= 19000/6 — 1:185 
= 3165-482. 
We havo already’ calculated the 5.0.5. for the main effects А 
and B, which are involved in this interaction, hence 
= 3165-482 — 765-593 — 2369-370 


interaction 5.0.5. for AB 
= 30-519. 


(5) Next we extract the two-way table for interaction АС, 
each reading of which will be the sum of 3 readings. This is 


shown in Table XITO. 
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Тлвге XIIIC 

A, A, А, Totals 
(05, 22 14 _8 28 
Ов 1 3 —19 =B 
G3 2, 5 —21 —14 
C, 14 4 —19 RE 
of 14 7 —16 5 
C, 10 4 —16 e 
Totals 63 37 . —92 8 


From this table, interaction 5.0.5. for AC is 
[222 +1424 ( — 82... + 10° 42+ (— 16)2]/3 — 82/54 
— 765-593 — 118-148 


= 2814/3 — 1-185 — 765-593 — 118-148 
= 53:074. 


(6) Finally we have, for interaction BC (Table XIII D): 


TABLE XIIID 
ij B, B, Totals 
(Д 33 12 = 17 28 
0, 22 0 — 30 H8 
Gs 19 —5 (98 = 14 
CO, 22 4 EDT E 
CE 23 3 = 2] № 5 
Os 22 3 = 27 HA 7 
Totals 141 17 — 150 8 
This table is obtained by adding the three B, entries in 
Table XIIIA for Су, ie. 1541345 = 33, and so оп. 
From Table XTITD, interaction s.o.s. for BO is 


[339+ 1224 (— 172... +2224 324 (— 27)2]/3 — 82/54 


— 2369-370 — 118-148 
= 7506/3 — 1-185 — 2369-370 — 118-148 
= 13:297. 


We may now complete our analysis table as under, obtaining 
the remainder term by adding the s.o.s. for the three main 
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effects and the first-order interactions and subtracting from 
the total s.o.s. The remainder variance is as usual used as the 
denominator of the v.&.'s. 


Mean 

3.0.8. D.F. squares У.В. 
Between illuminations (4) 765:593 2 382-80 150-7 
Between speeds (B) 2369370 2 1184-68 466-3 
Between subjects (C) 118-148 5 23-63 9:3 
Interaction АВ 30-519 4 7-63 3:0 
Interaction АС 53-074 10 5:31 2.09 
Interaction BC 13:297 10 1:33 0:52 
Remainder 50.814 20 2-54 = 


Total 3400815 53 


11.vi. Interpretation of analysis. The most striking feature 
of this analysis is the very large variance ratios for speeds 
and illuminations. The data obviously depart significantly 
from homogeneity—reference to tables is unnecessary with 
ratios of this size. The experiment shows that both speed 
and illumination have а very marked effect on performance, 
especially speed. Differences between subjects are also signi- 
ficant but are ofno particular interest. Reference to Fisher and 
Yates’s table shows that the interaction AB is just significant 
at the 5% level, but neither of the other first-order inter- 
actions is significant. The meaning of this significant inter- 
action AB is that the effects of speed and illumination are not 
entirely independent. - ee i 

Before, however, We accept this interaction as important 
we may apply a further test to it. Since the interactions AC 
and ВС are not significant they may be combined with the 
‘remainder’ to give us а larger number of D.F. for estimating 
the parent variance. This is done by adding the 5.0.5. for the 
interactions АС and BC and remainder, and also their D.F., 
Which gives us а new remainder variance, estimated with 
40 .r., which may be used аз а denominator for the v.R.'s. 
In general, the more degrees of freedom there are available 
н jance, the more reliable is the estimate. 


for estimating а Vat А 
И Ре Py this: meansyis\siven 


below. 


csc 


то 
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Mean 

3.0.5. D.F. squares V.R. 
Between illuminations (4) 765-593 2 382-80 130-7 
Between speeds (В) 2369-370 2 1184668 404-4 
Between subjects (C) 118:148 5 23-63 8-1 
Interaction (AB) 30-519 4 7-63 2-60 
Remainder 117-185 40 2-93 — 
Total 3400-815 53 


Consulting Fisher and Yates's Table V with № = 4 and 
N, = 40, we find a value for the v.n. of 2-61 for P = 0:05. This 
means that now we have employed a better estimate of the 
parent variance based on a larger number of D.F., the inter- 
action АВ is just on the borderline of significance and so 
cannot be regarded as having an important effect on perform- 
ance. The final conclusions from the analysis would therefore 
be that speed has а, very marked effect on performance and 
intensity of illumination also has а marked effect, though not 
80 marked as that of speed, within the range of this particular 
experiment. The practical value of an experiment such as 
this is to show that in а human activity requiring reactions. 
of the sort exemplified it is most important to control the 
speed factor for efficient performance, and also, though in а 
less degree, to control the factor of illumination. The optimum 
values of speed and illumination necessary for maximum 
efficiency cannot be deduced from the above analysis and 
would have to be the subject of further research. 


11.vii. n-way classification. 
Similar lines to those exemplified above may be made with 
data which may be classified in 4, 5 or more ways. The amount 
of arithmetic becomes progressively greater, however. For 
instance, with a 4-way classification, A, B, C and D, there will 
be 6 first-order interactions, AB, AC, AD, BC, BD and CD, 
and 3 second-order interactions, ABC, ABD and BCD, and 
the remainder will be a third-order interaction, ABCD, 
Similarly, with a 5-way classification there will be 10 first- 
order, 10 second-order and 5 third-order interactions, and the 
remainder will be a fourth-order interaction. 


Analysis on precisely 
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Although the analysis of variance in these cases presents no 
added difficulties, apart from the greatly increased arith- 
metical labour, in practice, particularly in the psychological 
field, the design of an experiment becomes increasingly 
difficult. If the main effects are to be directly comparable, it 
is essential that each reading shall be capable of being classified 
under each separate class heading. This increases the number 
of tests each subject has to perform and may introduce 
practice effects and fatigue effects which it may not be possible 
to isolate. 

In the analysis of the previous example the implicit assump- 
tion was made that the task required of the subjects was so 
simple that neither practice nor fatigue would affect per- 
formance. This is by no means always the case and very great 
care has to be exercised in designing psy chological experiments 
where variables such as these may enter. Such devices as 
taking the order in which the tests are done as one of the 
or taking matched groups of subjects, each 
a fraction of the total tests, are useful on 
the data in such cases need handling 


Occasion. However, ; я 
with great care and the elementary student is strongly advised 


to seek expert help both in designing an experiment 80 that it 
will yield data of an appropriate type and also in analysing 
the results. The normal tests of homogeneity are invalidated 
if the assumptions, previously mentioned, on which such tests 
are based are not warranted. It is true that the analysis of 
variance may be carried out where the data are not normally 
distributed, where numbers in sub-groups are unequal and 
even when parts of the data are missing, but homogeneity 
tests in such cases are complicated and laborious and are in 
any case beyond the bounds of this simple introduction. 


classes of variates, 
of whom does only 


EXERCISE ON CHAPTER XI 


38. Re-work and check the arithmetic of Examples 43, 44 and 45. 


T. 


H 


Answers to Exercises 


est 1-25 26-50 51—75 76-100 
A 43-20 . 44-68 44-52 44-98 
B 175-76 176-72 193-28 193-72 
с 28-76 22-92 23:12 20-64 
D 28-64 33-56 29:96 30-00 
Е 125:00 120-80 124-40 130-32 
F 27:48 27:20 26:68 27:72 
G ' 35:08 38-40 36:16 37-52 
. (а) 27-34. (b) 27-20. 
. Test (а) (5) Discrepancy 
A 44-14 44-17 0:03 
В 185-40 184-87 0:53 
[0 23:75 23-86 0:08 
D 30-38 30-54 0:16 
E 125:20 125-13 0-07 
Е 27-18 27:97 0:09 
G 36:75 36.79 0:04 
. Test (a) (5) Test 0 (5) 
45-5 44-0 E -. 1930 130-0 
B 167-0 185:5 F 28:0 25:5 
с 25-5 22-0 .G 37:5 39:0 
30:5 27-5 
. Test (а) (0) Test (a) (b) 
A 48-62 43-20 E 123-20 — 135.28 
B 148:52 169-50 F 29-32 22-10 
[0] 24-82 22.24 G 39-02 43-32 
D 29-30 22-54 
Lower Upper Semi-inter- 
est quartile quartile — quartile range 
A 38-0 50-0 6-0 
Be 157-0 203-0 23-0 
G 15-0 31:0 8-0 
D 19:5 40-0. 10-25 
Е 107-5 142.0 17-25 4 
Е 23:0 32-0 4:5 
9 99:0 42.0 6-5 
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7. Test (а) (5) Test (a) (5) 
А 7-66 9-76 Е 17-34 17-32 
B 24-08 2998 .| F 4-74 5:52 
[e] 10:34 8-80 а 7-18 6.88 
D 9-90 12-50 

8. (а) 20-57. (5) 15-05. 

9. Test (a) (b) (c) (d) 
A 11-06 8-18 13-77 10-81 
C 12:87 11-09 - 11-93 10-06 
D 12-03 12-33 14-78 15:05 
p 5:99 6-41 5:90 7-35 
а 9:36 843 817 8-58 

10. Test S.D у. Test, S.D. V. 
A 11:24 25:46 Е 21-13 16:88 
В 37-47 20-21 p 6-46 23-77 
0) 11:80 49:62 а 8-70 23-67 
D 13-76 45-29 


1. £,=0-000002; В» = 3:11, 
у, = 0:0014; S.e. = 0:245, 
у = 0:11; Se. = 0:490. 
Hence the distribution does not depart significantly from normality. 


" ^ 
12. g, = 0.0520; f, — 244 , 
y; = 0:23; S.e. 0:246, 
у= — 0:56; S.e. = 0-490. 


Hence the distribution does not 


depart significantly from normality. 


13. 6. 

14. 4P = 0.0668. There is therefore no evidence that people were 
not guessing. 

0-12, t = 0:037. 1 


15. Mean difference = igni 
does not depart significantly from zero. 


Hence the mean difference 


16. Standard _ Standard 
; ` error Difference error Difference 
(а) 1-421 7-39: . | (а) 1-520 3-20 
(b) 1:813 rit (e) 1-628 6:37 
(e) 1:345 i (f) 1:084 1 9-57 


АП these differences are significant. | 
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17. (а) t — 0:52; difference not significant. 
(b) £ — 1-70; difference not significant. 
(c) t = 1-40; difference not significant. 
(d) t — 1-29; difference not significant. 
(e) t = 1-94; difference not significant. 
(f) t= 4-42; difference significant. 
(g) t = 3-53; difference significant. 
(А) t = 4-25; difference significant. 
18. У.К. = 3-80. This is less than the value of 4-21 approx. given 
in Fisher and Yates, hence the conclusion is not invalidated. 


19. Diff. between proportions = 0-19. S.e. of diff. = 0:0482. Diff. 


is significant, i.e. influenza is definitely more prevalent in the first 
town. 


20. (a) r — 0-215. (c) 7 — 0-088. $ 
(b) r= 0:246. (d) r — 0:266. 
D2) А B (2) D E Е а 
Л == 0:372 — 0:268 — 0:262 — 0:075 — 0:395 — 0:158 
В 0:372 —  —0-170 —0-068 0-105 —0:057 —0-076 
C —0:268 —0170 — 0151 0-236 0:309 0-151 
D —0-262 —0:068 0:15  — 0-032 0-388 0:105 
Е —0:075 0105 0-236 0032 — 0.338 0:252 
FP —0:395 —0:057 0-300 0388 0338 . — 0:289 
9 —0158 —0-076 0151 0105 0:252 0:989 — 


This table illustrates a meth 


od of tabulating the coefficients of corre- 
lation between each pair of 


у a set of tests. For example, the coefficient 
of correlation between test C and test F is given in the row headed C 


under the column headed Е, or in the row headed F under the column 
headed С, and is seen to be 0-309. 


23.. Correlation between Е and A; Б.е. = 0:0844, 


” ” 0-0997, 
0:0905, 
0-0849, 
0-0886, 
» » Е and G; S.e. = 0-0916. 
The coefficient differs si 
between F and В. 


” » 


» » 


gnificantly from zero in every case except that 


24. ть is significant] 


У different from rpp but not from any of the 
others specified. 


25. rp; = 0:363, 


Top. p = 0:275, 
Тра в = 0:224, 
Tgp,g = 0:286, 


Tor n = 0-259. 
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26. (a) Between H and С (1-25), р = 0-026, 
H and D (1-25), p = 0-141. 

(b) Between C and D (1-25), p — 0-171, 
From Exercise 20 (а), r= 0:215. 


27. 7 = 0-440. 


28. Between H and C (1-25), т = — 0-003, 
Н and D (1-95), т= 0:128, 
С ава D (1-25), т= 0:117. 


n F, z = 0:3387; hence the re- 


29. (a) For the regression of G о 
the variance ratio = 1-97. That 


gression is not linear. (Alternatively, 
in the Fisher and Yates's 5% point table for n, = 12 and n, = 60 is 


1-92; hence the regression departs from linearity.) 
d For the regression of F on G, the variance due to deviations from 
linearity is less than that within arrays; henee the regression is linear. 


(b) 7а = 0-530, 72, = 0-394. 


30. Between n с ip 
A and I 6-821 0:253 0-339 
B and I 3.392 0-181 0-757 
C and I 8-987 0-287 0-174 
D and I 13:835 0-349 0:032 Significant 
E and I 16-432 0-376 0:012 Significant 
F ава I 50:085 0:578 0-000001 Significant 
G and I 16-352 0-375 0-012 Significant 
31. Between 2 S 
Piena T 0-62 0-431 
B and 1 0-47 0-493 
C and I 2.52 0-112 
Dand I 5:70 0-017 Significant 
Е and I 10-38 0.0013 Significant 
F and I 38-30 0-0000001 Significant 
G and І 8-20 0-004 Significant, 
H and I 5711 0-016 Significant 


32, у= 11:58. РЕ 0:009, hence the association is significant. 

2, For this P is exceedingly small. Hence the hypo- 
served data very badly and we conclude that the 
Note that he has an undue tendency to record 


33. х= 160-0: 
thesis fits the ob: 
observer is unreliable. 
readings ending in 0 or 5. 


Qe + 1:585. Mean square for linear regression is 408-845. 


34. y= 3-1 : Sage: 
tures from linear regression is 0-255, hence the fit is 


that for Чераг 
exceedingly good. 
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35. (а) d=6. The triads are ABEA, ACEA, ADEA, BCFB, 
BDFB, BEFB. $ 


(b) £— 0-571. 

(c) Order is GABCDEF. 

(d) The subject is inconsistent in his judgments. This may be 
because his ability is poor or because the task is too 


difficult. This could be decided only by further experi- 
ments. 


36. (а) и= 0-556. /(2x2) — J(2y— 1) = 7.05, hence u is significant. 
(6) We should conclude that G, A and F are quite distinct in 


the quality judged, but that there is probably no real 
difference between B, О, D and Е i р 


37. W = 0:853. z= 


1:4284, hence W is significant. 


ж 


Table of (n, -та(т na — 2) (па + 23) 


п 


12 
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13 


14 


15 


16 


17 


18 


153 


19 


9:98 
10-49 
10:98 
11-45 
11-90 


12.34 
12-77 
13-18 
13:58 
13:97 


14-35. 
14.72 
15-08 
15-43 
15-78 


16-12 
16:45 
16:77 
17:09 
17:41 


17.72 
18:02 
18:32 
18:61 
18:90 


19:19 
19:47 
19-75 
20-02 
20:29 


20:56 
20:82 
21:08 
21:34 
21:60 


21:85 
22-10 
22:34 
22-59 
22.83 


23:06 


10-89 
11:45 
11-98 
12-49 
12-98 


13-46 
13:92 
14:36 
14-80 
15:22 


15:63 
16:03 
16:42 
16:80 
17-18 


17.55 
17:91 
18:26 
18:61 
18:95 


19:28 
19:66 
19:94 
20:26 
20:57 


11:33 
11:90 
12-45 
12-98 
13-49 


13:98 
14-46 
14:92 


11:75 
12:34 
12:91 
13-46 
13-98 


14-49 
14-98 
15-46 
15:93 
16:38. 


16:82 
17:25 
17:67 
18:08 
18-48 


18:87 
19:26 
19:64 
20:01 
20-38 


20-74 
21:09 


) | 2144 


21-78 
22-12 


22-45 
22.78 
23-10 
23-42 
23-73 


24-05 
24-35 
24-66 
24.96 
25-25 


25-54 
25:83 
26-12 
26-40 
26-68 


26:96 


12-15 
12-77 
13-35 
13-92 
14-46 


14-98 
15:49 


12-55 
13-18 
13-78 
14-36 
14-92 


15-46 
15-98 
16:49 
16:99 
17-47 


17-93 
18-39 
18-84 
19:27 
19-70 


20-12 
20-53 
20:93 
21-33 
21.72 


22-10 
22-47 
22.84 
23:21 
23-57 


23-92 
24-27 
24-61 
24-95 
25.28 


25-62 
25-94 
26:26 
26-58 
26-90 


27-21 
27-52 
27-82 
28-12 
28-42 


28-72 


12-93 
13-58 
14.20 
14-80 
15-37 


15:93 
16:46 
16-99 
17:49 
17:99 


18:47 
18:94 
19:40 
19:84 
20:28 


20-71 
21-14 
21-55 
21-96 
22-36 


22-75 
23-13 
23:52 
23-89 
24-26 


24-62 
24-98 
25.33 
25:68 
26-03 


26:37 
26:70 
27-03 
27-36 
27-68 


28-01 
28-32 
28-63 
28-94 
29-25 


29-56 


13-30 
13-97 
14-60 
15-22 
15-81 


16:38 
16:93 
17:47 
17:99 
18:49 


18:99 
19:47 
19-94 
20:40 
20-85 


21:29 
21-73 
52.15 
22-57 
22-98 


23:38 
23-78 
24-17 
24-55 
24.93 


25:31 
25:67 
26-04 
26:39 
26-75 


27-10 
27-44 
27-78 
28-12 
28-45 


28:78 
29-11 
29:43 
29-75 
30-06 


30:37 
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17-17 


18-83 
19-66 
20-36 


21-09 
21-79 
22-47 
23-13 
23-78 


24-41 
25-02 
25-62 
26-20 
26-78 


27-34 
27-89 
28-43 
28:96 
29:48 


29-99 
30:50 
30-99 
31-48 
31-96 


32-44 
32-90 
33:37 
33:82 
34.27 


3471 
35:15 
35:59 
36-01 
36:44 


36-86 
37:27 
37-68 
38:08 
38:48 


83:88 


17-74 
18-61 
19-45 
20-26 
21-03 


21-78 
22-50 
23-21 


23.89 2 


24-55 


25-20 
25-83 
26-45 
27-05 
27-64 


28-22 
28-79 
29:35 
29:89 
30:43 


30:96 
31:48 
31:99 
32-50 
32.99 


33:48 
33-96 
34-44 
34-91 
35:37 


35:83 
36:28 
36:73 
37-17 
37-61 


38-04 
38:47 
38:89 
39-31 
39-72 


40-13 


сочно 


19-08 
20:02 
20:92 
21-79 
22-62 


23:42 
24-20 
24-95 


25.08 2 


26:39 


27-09 
27-77 
28-43 
29:07 
29:71 


30:33 
30-94 
31:53 
32:12 
32-70 


33-26 
33-82 
34-37 
34-91 
35-44 


35:97 
36:48 
36:99 
37-50 
37.99 


38:48 
38:97 
39-45 
39-92 
40:39 


40-85 
41:31 
41:76 
42-21 
42-65 


43-09 
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43 


45 


46 


47 48 


49 


19:60 
20:56 


22-37 
23-22 


24-05 
24-84 
25:62, 
26:37 
27.10 


27.81 
28-50 
29-18 
29:85 
30-50 


31:13 
31:76 
32.37 
32:97 
33-56 


34-14 
34:71 
35:28 
35:83 
36:38 


36:91 
37-44 
37:97 
38-48 
38:99 


39-50 
39:99 
40:48 
40:97 
41-45 


41-92 
42.39 
42-86 
43-32 
43:77 


44:49 


20:82 
21:85 
22:83 
23-77 
24-67 


25-54 
26-39 
27:21 
28-01 
28-78 


29-53 
30-27 
30:99 
31-69 
32-38 


33-06 
33-72 
34:37 
35:01 
35-03 


36-25 
36:86 
37:45 
38-04 
38:62 


39-19 
39-75 
40.30 
40:85 
41-39 


41-92 
42:45 
42:97 
43:49 
43-99 


44-50 
44-99 
45:49 
45:97 
46-46 


46-93 


21-06 
22-10 
23-08 
24-04 
24-95 


25.83 
26-69 
27.52 
28.32 
29-11 


29-87 
30.61 
31.34 
32.05 
32.75 


33:43 
34:10 
34:76 
35-40 
36-03 


36-66 
37.27 
37.87 
38-47 
39:05 


39-63 
40:19 
40:76 
41:31 
41:85 


42:39 
42-92 
43:45 
43-97 
44-49 


44-99 
45-50 
45-99 
46:49 
46:97 


47:46 


21:30 21:53 
22-34 22-59 
23:34 23:60 
24-30 24-57 
25:23 25-50 


26-12 26-40 
26:98 27.28 
27.82 28-12 
28:63 28:94 
29:43 29-75 


30-20 30:52 
30-95 31:29 
31.69 32:03 
32:41 32-76 
33-11 33:47 


33:80 34-16 
34.47 34-85 
35:14 35:52 
35-79 36:17 
36-43 - 30-82 


37-06 37-46 
37-68 38:08 
38:29 38-70 
38:89 39:31 
39:46 39:90 


40-06 40-49 
40:64 41-07 
41-20 41-64 
41-76 42-21 
42.31 42.77 


42-86 43-32 
43-40 43:86 
43-92 44-40 
44-45 44-03 
44-97 45-45 


45:49 45-97 


45:99 46:49" 


46-50 46-99 
46-99 47-50 
47-49 47. 9 


£ 
47-97 48:49 


21-76 
22-83 
23:85 
24-83 
25:77 


26-68 
27-57 
28-42 
29-20 
30-06 


30-85 
31-62 
32.37 
33-10 
33-82 


34:52 
35:21 
35-89 
36-56 
37-21 


37-85 
38:48 
39-11 
39-72 
40:32 


40:92 
41-50 
42.08 
42.65 | 
43:22 


43:77 
44:32 
44-86 
45:40 
45-93 


46-46 | 


46:97 | > 


47-49) 
47-99 
48:50 


48:99: 
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Curve showing values о] N necessary for 
N different values of r to be significant 


N 


14 015 0-16 017 0-18 0-19 0-20 0-21 0:22 0:23 0:24 0:25 0:267 


042 013 0 


If a calculated 7 fr 5 
value given by the curvi 


significant. if the number of observations, N, ша sample yielding 


om a sample of N observations is larger than the 
corresponding to that value of N, it is 


Conversely» 


: w value RCRUM] 
а pericu. that value of 7, then r is significant, 


ER cases the standard error of r should always be calcu- 


of r is larger than that given by the curve corre- 


lated as a check. 
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APPENDIX С 
Table of values of 6[N (N? — 1) for different values of N 


The required values are the reciprocals of the numbers below. 


N N N 

10 165 30 4495 50 20825 
ОВ. 220 31 4960 51 22100 
12 286 32 5456 52 23436 
13 364 33 5984 53 24802 
14 455 34 6545 54 26235 
15 560 35 7140 55 27720 
16 “680 36 7770 56 29260 
17 816 37 8436 57 30856 
18 969 38 9139 58 32509 
19 1140 39 9880 59 34220 
20 1330 40 10660 60 35990 
21 1540 41 11480 61 37820 
22 1771 42 12341 62 39711 
23 2024 43 13244 63 41664 
24 2300 44 14190 64 43680 
25 2600 45 15180 65 45760 
26 2925 46 16215 66 47905 
27 3276 47 17296 67 50116 
28 3654 48 18424 68 52394 
29 4060 49 19600 69 54740 


To obtain р by the use of the above table, divide X(d?) by the 
number in the table opposite the appropriate value of N and subtract 
the answer from 1. 


Square 


Squares of numbers from 1 to 99 


APPENDIX D 


Square 
676 
729 
784 
841 
900 


961 
1024 
1089 
1156 
1225 


1296 
1369 
1444 
1521 
1600 


1681 
1764 
1849 
1936 
2025 


2116 
2209 
2304 
2401 
2500 


№. 


51 
52 
53 
54 
55 


56 
57 
58 
59 
60 


61 
62 
63 


Square 
2601 
2704 
2809 
2916 
3025 


3136 
3249 
3364 
3481 
3600 


3721 
3844 
3969 
4096 
4225 


4356 
4489 
4624 
4761 
4900 


5041 
5184 
5329 
5476 
5625 
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APPENDIX Е 
Numerical material for exercises 


The following two pages of figures give the scores of 100 subjects 
‘in certain psychological tests. The scores in seven such tests are given 
in columns A to G inclusive. Column H gives the ranked order of 
merit of the subjects in a scholastic examination, and column I gives 
an assessment of the practical ability of the subjects in some handi- 
craft, This assessment is in four categories—Very Good (V.G.), Good 
(G.), Fair (F.) and Poor (P.). 

‘These two pages of scores provide the basis for most of the exercises 
at the end of the successive chapters of the text. The student who 


wishes for additional exercises may easily make up others for himself 
from the figures given. 4 


Subject 
1 


A 


APPENDIX E 


D Е 
45 121 
30 145 
5 108 
25 141 
27 96 
33 154 
41 102 
62 156 
24 147 
37 124 
23 98 
15 92 
35 170 
38 126 
19 127 
11 131 
20 111 
40 105 
31 122 
30 100 
27 152 
16 125 
38 116 
29 125 
15 131 
10 94 
42 80 
36 109 
32 114 
40 158 
10 94 
28 137 
53 127 
35 137 
28 107 
43 121 
37 142 
35 141 
30 115 
49 121 
43 106 
43 109 
20 129 
59 133 
25 107 
18 149 
30 158 
43 90 
- 15 141 
35 101 


161 


M 
Ф м 


Sese 


< 
а 


QQt Hou 


HO RH PRA H mih 


< 
hj В 
CELINE 


Зоя 
225? 


чна 


4 
"p 


Qs 


< 
"gp 


н 
я 


162 APPENDIX Е 


Subject А B а D E Е а H I 
51 па 7962 А rol вла “25 * "1 P. 
52 ПОО 7) Ш о CA 40, УС 
53 51 247 а 10, 130 23 407 41 F. 
54 389 2058 Во БИ Ц, м bl 49. Уус 
55 58 266 22 26 126 19 26 42 IE 
56 4019210) 3347" 125 141 23 39 19 G 
57 59 139 Б отлов 9 22. 30; 40. F 
58 31 я 27, 38 142 24 35 21 P 
59 40 178 17 7 84 24 33 91 Р, 
60 БОО 120* 95 Кав 24° 50 168 G 
61 841 150. 27 98 907 gy 19) 27 499) Е 
БИИ 216 36° 34 4157 89) 37 9 VG. 
63 39 195 24 1 151 2925 29 "à Р 
64 ОТВ 25 i 99 2 25 88 Е. 
Ga О 1718 №27 21 146 81 30 м G. 
66 857 AET 48 e013) 162) * "38. - 44 2 Е 


67 LOR о БО о —39 63 VG 


RI 
tw 
w 
Ж 
= 
= 
o 
X 
e 
e 
= 
is 
oo 
© 
A 
ж. 
5 
= 
a 
3.33 
AW 
ЕР 


ES] 
> 
t2 
e 
= 
= 
> 
= 
= 
9 
© 
m 
eo 
> 
e 
Ке) 
is 
© 
r3 
Ф 
4 
Q 


77 45 244 14 34 113 25 39 66 Е. 


83 40200273111 7377 07147, SSM 408 1525 а. 


oo 
A 
w 
№ 
[-] 
> 
e 
= 
a 
= 
is 
= 
> 
© 
to 
я 
5 
[3 
ГӘ 
о 
hi Qs itd 


90 43. 187 33 58 152 41 53 14 V.G. 


Ф 

[3 

ye 

ex 

= 

a 

© 

w 

eo 

w 

e 

= 

| 

to 

© С9 

a e 

E 9 
Ё 

о to 
oo 

> ee 

4 

дюноюо 


Ф 
<> 
* 
^ 
e 
S 
= 
к 
oo 
= 
ыы 
© 
t2 
№ 
w 
о 
e 
Ф 
kj Qt hj 


21 


This table may be used 
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Values of n(n —1) (2n + 5) for different values of n 


n(n—1) (2n +5) 


18 
66 
156 
300 
510 
798 
1176 
1656 
2250 
2970 
3828 
4836 
6006 
7350 
8880 
10608 
12546 
14706 
17100 
19740 


n mn(n—1)(2n45) 


22 22638 
23 25806 
24 29256 
25 33000 
26 37050 
27 41418 
28 46116 
29 51156 
30 56550 
31 62310 
32 68448 
33 74976 
34 81906 
-35 89250 
36 97020 
37 105228 
38 113886 
39 123006 
40 132600 
41 142680 


n 


n(n— 1) (2n +5) 


153258 
164346 
175956 
188100 
200790 
214038 
227856 
242256 
257250 
272850 
289068 
305916 
323406 
341550 
360360 
379848 
400026 
420906 
442500 


in calculating the significance of т. 
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APPENDIX С 
Values of t(1— 1) (t— 2) for different values oft 


t t(£— 1) ((—2) i t(t—1) (([—2) t 102—1) (¢—2) 
35 6 19 5414 35 39270 
4 24 20 6840 36 42840 
5 60 21 7980 37 46620 
6 120 22 9240 38 50616 
tj 210 23 10626 39 54834 
8 336 24 12144 40 59280 
9 504 25 13800 41 63960 
10 720 26 15600 42 68880 
11 990 27 17550 43 74046 
12 1320 28 19656 44 19464 
13 1716 29 21924 45 85140 
14 2184 30’ 24360 46 91080 
15 2730 31 26970 47 97290 
16 3360 32 29760 48 103776 
17 4080 ‘| 33 32736 49 110544 
18 4896 34 35904 50 117600 


This table may be used for 


calculating the significance of т when 
there are tied rankings. 
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APPENDIX H 
Values of mn(m — 1) (n—1)/4 for different values of n and m 


n т= 2 т = 3 т = 4 т= 5 т= 6 
3 3 9 18 30 45 
4 6 18 36 60 90 
5 10 30 60 100 150 
6 15 45 90 150 225 
7 21 63 126 210 315 
8 28 84 168 280 420 
9 36 108 216 360 540 

10 45 135 270 450 675 

11 55 165 330 550 825 

12 66 198 396 660 990 

13 78 234 468 780 1170 

14 91 273 546 910 1365 

15 105 315 630 1050 1575 


To calculate и, divide 22 by the appropriate number above and ~ 
subtract 1. 


APPENDIX I 


Values of v for different values of n and m 


"n m8 т=4 m=5 m=6 
3 18 9 6-6" 5:025 
й 36 18 13-3" 11-25 
5 60 30 22-2° 18-75 
6 90 45 33.3 28-125 
1 126 63 46-6" 39-375 
8 168 84 62.2 52-5 
9 216 108 80 67-5 

10 270 135 100 84-375 

11 330 165 122-2 .103-125 

12 396 198 146:6° 123-75 

13 468 234 173-3" 146-25 

i 546 273 . 2022: 170-625 

15 630 315 233-37 196-875 


This table may be used in the testing of the significance of и. 


к 
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APPENDIX J 
1 [n [m| m —3* 
= j dm 
Values of 5 ( 2) ( 2) PUES for different values of n an 
n m=4 m=5 m=6 
2 15 3-3" 5:625 
3 4-5 10 16-875 
4 9 20 33-75 
5 15 339-3" 56:25 
6 22-5 50 84-375 
7 31.5 70 118-125 
8 42 93-3" 157-5 
9 54 120 202-5 
10 67-5 150 253-125 
11 82-5 183-3° 309:375 
12 99 220 371-25 
13 117 260 438-75 
14 136-5 303-3" 511-875 
15 157-5 350 590-625 


This table may be used in calculating the significance of u. 


* For т = 3 this expression equals 0 for all values of n. 


APPENDIX K 


Values of m*(n? —т)[12 for different values of n and m 


n m= m=4 m=5 m=6 
3 18 32 50 72 
4 45 80 125 180 
5 90 160 250 360 
6 157-5 280 437-5 630 
m 252 448 700 1008 
8 378 672 1050 1512 
9 540 960 1500 2160 

10 . 7425 1320 2062-5 2970 

11 990 1760 2750 3960 

12 1287 2288 3575 5148 

13 1638 2912 4550 6552 

14 2047-5 3640 5687-5 8190 

15 2520 4480 7000 10080 


To find W, divide S by the appropriate number in the above table, 
If there are ties, divide S by the appropriate number from the table 
less mX(T). 
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